bonifaido / rust-zookeeper

Pure Rust library for Apache ZooKeeper built on MIO
MIT License
206 stars 46 forks source link

IO thread panic on Zookeeper client reconnect failure #56

Closed davidwilemski closed 5 years ago

davidwilemski commented 5 years ago

I have a use case where I am utilizing a zookeeper::ZooKeeper client instance to maintain an ephemeral znode while my application does other work. I've found that the client panics in its reconnection logic on an internal thread when I kill the zookeeper server that I am testing with. This leaves my application running but without the client connection in a functional state.

The backtrace that I see is the following:

thread 'io' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/libcore/result.rs:1009:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:59
             at src/libstd/panicking.rs:211
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:227
   4: <std::panicking::begin_panic::PanicPayload<A> as core::panic::BoxMeUp>::get
             at src/libstd/panicking.rs:491
   5: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:398
   6: std::panicking::try::do_call
             at src/libstd/panicking.rs:325
   7: core::char::methods::<impl char>::escape_debug
             at src/libcore/panicking.rs:95
   8: core::alloc::Layout::repeat
             at /rustc/9fda7c2237db910e41d6a712e9a2139b352e558b/src/libcore/macros.rs:26
   9: <zookeeper::acl::Acl as core::clone::Clone>::clone
             at /rustc/9fda7c2237db910e41d6a712e9a2139b352e558b/src/libcore/result.rs:808
  10: zookeeper::io::ZkIo::reconnect
             at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/io.rs:326
  11: zookeeper::io::ZkIo::ready_zk
             at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/io.rs:429
  12: zookeeper::io::ZkIo::ready
             at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/io.rs:366
  13: zookeeper::io::ZkIo::ready_timer
             at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/io.rs:549
  14: zookeeper::zookeeper::ZooKeeper::connect::{{closure}}
             at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/zookeeper.rs:78

I believe this is due to the unwrap() call at this line: https://github.com/bonifaido/rust-zookeeper/blob/e25f2a0ee6cc2667430054f08c8c69fca1c8c4e9/src/io.rs#L326

I also have a listener on the connection that right now just logs the state transitions of the client. I see the client go through the Connected -> NotConnected and NotConnected -> Connecting state transitions before the panic happens.

In order to reproduce this behavior I've been using Docker to start and stop a local ZK server using the Docker Hub official Zookeeper Docker image. To run the server and expose a port, you can run docker run --rm -p 2181:2181 --name test-zookeeper -d zookeeper on a machine with docker installed.

I could handle the disconnect from within my application by watching for the NotConnected event and taking action from there (either exiting the rest of the application or trying to rebuild the client) but I think it would be nice to resolve some of this from within the client library as well. It doesn't seem like the client's internal thread should panic, leaving the last client state event the caller receives to be Connecting.

Two options that come to mind for handling this situation are:

  1. Instead of panicking, publish some sort client state indicating it is permanently failed/not connected. It looks like ZkState::Closed might already fit the situation and could potentially be published in this case.
  2. Add a bit more logic to the reconnect routine to continually retry or perhaps have a definable policy to try more times before entering into the state I describe in option one.

What do you think about these options? Would you be amenable to a PR to at the least handle the case where the reconnect fails and we publish a ZkState::Closed event to the listeners?

bonifaido commented 5 years ago

HI @davidwilemski, I'm more than happy to receive a PR for this, your options sound very feasible, I agree that adding some kind of retry logic to this before closing the client entirely sounds good.

bonifaido commented 5 years ago

And thank you for the idea for using a Docker image for testing, I will change the CI build and examples to use Docker (by the time when I was implementing the original version of this library I haven't had access to a Docker engine :) )

davidwilemski commented 5 years ago

I've dug a little bit deeper into this crash and the zookeeper::io source code and have some findings. I have some code up for discussion but I'm not 100% convinced it's ready for merge yet and want to bang on it a little bit more before I'm convinced it's ready. I didn't have any experience with Mio before digging into this issue so I've leaned heavily on reading code and docs and am happy to defer to more experienced opinions. That said, I have some thoughts.

To start with, and this is speculation, the error I was seeing on the Poll:deregister call might be related to carllerche/mio#911. I am running on a Mac and with extra debug logging I was able to see that the error type was for a file (descriptor) not found. We should still fix the panic within this library. (At the very least I'd really like to remove that unwrap on the Poll::deregister call).

Next, I found that Poll::deregister might not need to be called in the first place:

Evented handles are automatically deregistered when they are dropped. It is common to never need to explicitly call deregister.

https://docs.rs/mio/0.6.16/mio/struct.Poll.html#readiness-operations

After some experimentation, I can confirm that removing the deregister call entirely does appear to work and triggers the existing reconnect logic. On Mac, since I was previously getting an error on the FD, it seems unlikely that we're leaking any resources. I'm somewhat convinced that this is safe based on the documentation but have not experimented on other platforms.

However, this change comes with a catch. The catch is that we call through to the ready.is_hup() check in the next conditional down after ready.is_readable(): https://github.com/bonifaido/rust-zookeeper/blob/e25f2a0ee6cc2667430054f08c8c69fca1c8c4e9/src/io.rs#L414-L430

Since we've just reconnected, I don't think we really need to evaluate this block at all. Therefore, I added returns after the reconnect. We don't need to reset the timers because that is also done on reconnect and we don't need to Poll::reregister our interest because the reconnect already registers the new socket.

Moreover, the Mio docs emphasize hangup checks should be done as an "optimization" although I'm not entirely sure what. A semi-educated guess is as an optimization for just waiting for the poll/socket to timeout. It seems like that block may or may not be needed but I've tested with and without it and things seem to work fine either way.

Two other semi-related changes (as they're reconnect logic related) I found to make are:

  1. That any std::io::error that we get should be checked for ErrorKind::WouldBlock and not reconnect on that type of error as it indicates a spurious wakeup.

If operation fails with WouldBlock, then the caller should not treat this as an error and wait until another readiness event is received.

https://docs.rs/mio/0.6.16/mio/struct.Poll.html#spurious-events

  1. A reconnect should be made on a read/write that read/wrote 0 bytes. There is already a log that the socket has disconnected here so we should make the disconnect explicit and trigger reconnect here.

I'm having some Maven/Java issues on my local machine so I haven't been able to run the test cases that depend on a ZK server but have been testing heavily against the program I used in the original report of this issue. #57 will be really nice for avoiding this situation :)

Lastly, I have logs that show with the changes described above, the client is successfully reconnecting: https://gist.github.com/davidwilemski/caa2edb9dd08dd1115d9813a0a63602f

Of course, on reconnect, the application must still observe the client state change and re-add any ephemeral znodes it might have been maintaining but this is expected behavior I believe.

I'll get a PR up shortly with some of this description replicated in the comments but wanted to detail my debugging process while it was fresh.

bonifaido commented 5 years ago

https://crates.io/crates/zookeeper/0.5.6 has been released containing your fix @davidwilemski, thanks for the detailed explanation!

davidwilemski commented 5 years ago

Great, thanks for the quick release!