benashford / redis-async-rs

A Rust client for Redis, using Tokio
Apache License 2.0
253 stars 30 forks source link

PoisonError in reconnect.rs #60

Closed davidpdrsn closed 4 years ago

davidpdrsn commented 4 years ago

Hey! I'm seeing errors like this:

panic: Cannot obtain read lock: "PoisonError { inner: .. }", file: /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/redis-async-0.6.5/src/reconnect.rs:131:45

I cannot quite figure out under which circumstances that can happen. Do you have any hints?

Note that I'm on version 0.6.5. Do you think updating would impact this issue?

benashford commented 4 years ago

Thanks for this report.

It's not something I've seen myself, I have a large private repository that uses 0.6.5 and have not seen this error. I would expect the 0.7.x releases to behave the same way - the logic is the same, it just has a different Tokio dependency.

Could you provide more information about what's happening at the same time?

That line-of-code is called for every Redis command, it checks the connection state and tries to reconnect if necessary. Does this happen all the time? Occasionally? Or only if a connection has dropped?

The error message implies the lock has become poisoned, that normally happens if a panic happens while the lock is locked: https://doc.rust-lang.org/std/sync/struct.RwLock.html#poisoning so there may be a different panic shortly before this happens? That might be the root-cause and this is a downstream side-effect.

davidpdrsn commented 4 years ago

Could you provide more information about what's happening at the same time?

From what I can gather we saw a panic like this immediately prior in one instance

panic: A connected state shouldn't be attempting to reconnect, file: /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/redis-async-0.6.5/src/reconnect.rs:213:33
davidpdrsn commented 4 years ago

I have done some digging through our logs and I'm now confident that the panic I posted above is indeed what triggered the poisoned mutex.

benashford commented 4 years ago

Thanks for this.

That first panic is caused by a race conditions that ought to be impossible. But, upon closer inspection it is potentially possible for two attempts to reconnect the same connection to be kicked off in parallel.

I haven't seen this happen in other uses of the library, but we must have different usage patterns and yours is more likely to trigger the problem.

There are a few potential fixes:

  1. The race condition was caused because the read side of a RwLock was relinquished before the attempt to reconnect the connection (which requires a write lock). So if we used a Mutex instead we avoid this. This is what #61 fixes. The duration of the lock is still quite short in practice, as it only needs to lock long enough to check the state.

  2. Not panic when if there are concurrent reconnections, silent drop the second one instead if the connection has been re-established in the first place. It panics at the moment as it should (in theory) be unreachable code.

  3. Both 1 and 2.

davidpdrsn commented 4 years ago

Sounds good! Thanks for looking into it. I guess first one solution sounds like the safest choice 😊

Is there any chance you would consider a backport of the fix in a 0.6.6 release? We can't quite use 0.7 yet because we're waiting on hyper 0.14.

benashford commented 4 years ago

Yes, I'll do that. I need to review it again to make sure my assumptions are still sound, and test it a bit more, but when it's ready I'll include it in both the 0.6 and 0.7 branches.

davidpdrsn commented 4 years ago

Awesome! 🙏

benashford commented 4 years ago

This is now released on 0.6.6 and 0.7.3

davidpdrsn commented 4 years ago

Thank you for addressing this issue so quickly.