Closed davidwilemski closed 5 years ago
HI @davidwilemski, I'm more than happy to receive a PR for this, your options sound very feasible, I agree that adding some kind of retry logic to this before closing the client entirely sounds good.
And thank you for the idea for using a Docker image for testing, I will change the CI build and examples to use Docker (by the time when I was implementing the original version of this library I haven't had access to a Docker engine :) )
I've dug a little bit deeper into this crash and the zookeeper::io
source code and have some findings. I have some code up for discussion but I'm not 100% convinced it's ready for merge yet and want to bang on it a little bit more before I'm convinced it's ready. I didn't have any experience with Mio before digging into this issue so I've leaned heavily on reading code and docs and am happy to defer to more experienced opinions. That said, I have some thoughts.
To start with, and this is speculation, the error I was seeing on the Poll:deregister
call might be related to carllerche/mio#911. I am running on a Mac and with extra debug logging I was able to see that the error type was for a file (descriptor) not found. We should still fix the panic within this library. (At the very least I'd really like to remove that unwrap
on the Poll::deregister
call).
Next, I found that Poll::deregister
might not need to be called in the first place:
Evented handles are automatically deregistered when they are dropped. It is common to never need to explicitly call deregister.
https://docs.rs/mio/0.6.16/mio/struct.Poll.html#readiness-operations
After some experimentation, I can confirm that removing the deregister call entirely does appear to work and triggers the existing reconnect logic. On Mac, since I was previously getting an error on the FD, it seems unlikely that we're leaking any resources. I'm somewhat convinced that this is safe based on the documentation but have not experimented on other platforms.
However, this change comes with a catch. The catch is that we call through to the ready.is_hup()
check in the next conditional down after ready.is_readable()
: https://github.com/bonifaido/rust-zookeeper/blob/e25f2a0ee6cc2667430054f08c8c69fca1c8c4e9/src/io.rs#L414-L430
Since we've just reconnected, I don't think we really need to evaluate this block at all. Therefore, I added returns after the reconnect. We don't need to reset the timers because that is also done on reconnect and we don't need to Poll::reregister
our interest because the reconnect already registers the new socket.
Moreover, the Mio docs emphasize hangup checks should be done as an "optimization" although I'm not entirely sure what. A semi-educated guess is as an optimization for just waiting for the poll/socket to timeout. It seems like that block may or may not be needed but I've tested with and without it and things seem to work fine either way.
Two other semi-related changes (as they're reconnect logic related) I found to make are:
std::io::error
that we get should be checked for ErrorKind::WouldBlock
and not reconnect on that type of error as it indicates a spurious wakeup.If operation fails with WouldBlock, then the caller should not treat this as an error and wait until another readiness event is received.
https://docs.rs/mio/0.6.16/mio/struct.Poll.html#spurious-events
I'm having some Maven/Java issues on my local machine so I haven't been able to run the test cases that depend on a ZK server but have been testing heavily against the program I used in the original report of this issue. #57 will be really nice for avoiding this situation :)
Lastly, I have logs that show with the changes described above, the client is successfully reconnecting: https://gist.github.com/davidwilemski/caa2edb9dd08dd1115d9813a0a63602f
Of course, on reconnect, the application must still observe the client state change and re-add any ephemeral znodes it might have been maintaining but this is expected behavior I believe.
I'll get a PR up shortly with some of this description replicated in the comments but wanted to detail my debugging process while it was fresh.
https://crates.io/crates/zookeeper/0.5.6 has been released containing your fix @davidwilemski, thanks for the detailed explanation!
Great, thanks for the quick release!
I have a use case where I am utilizing a
zookeeper::ZooKeeper
client instance to maintain an ephemeral znode while my application does other work. I've found that the client panics in its reconnection logic on an internal thread when I kill the zookeeper server that I am testing with. This leaves my application running but without the client connection in a functional state.The backtrace that I see is the following:
I believe this is due to the
unwrap()
call at this line: https://github.com/bonifaido/rust-zookeeper/blob/e25f2a0ee6cc2667430054f08c8c69fca1c8c4e9/src/io.rs#L326I also have a listener on the connection that right now just logs the state transitions of the client. I see the client go through the
Connected
->NotConnected
andNotConnected
->Connecting
state transitions before the panic happens.In order to reproduce this behavior I've been using Docker to start and stop a local ZK server using the Docker Hub official Zookeeper Docker image. To run the server and expose a port, you can run
docker run --rm -p 2181:2181 --name test-zookeeper -d zookeeper
on a machine withdocker
installed.I could handle the disconnect from within my application by watching for the
NotConnected
event and taking action from there (either exiting the rest of the application or trying to rebuild the client) but I think it would be nice to resolve some of this from within the client library as well. It doesn't seem like the client's internal thread should panic, leaving the last client state event the caller receives to beConnecting
.Two options that come to mind for handling this situation are:
ZkState::Closed
might already fit the situation and could potentially be published in this case.What do you think about these options? Would you be amenable to a PR to at the least handle the case where the reconnect fails and we publish a
ZkState::Closed
event to the listeners?