Closed davidpdrsn closed 4 years ago
Thanks for this report.
It's not something I've seen myself, I have a large private repository that uses 0.6.5
and have not seen this error. I would expect the 0.7.x
releases to behave the same way - the logic is the same, it just has a different Tokio dependency.
Could you provide more information about what's happening at the same time?
That line-of-code is called for every Redis command, it checks the connection state and tries to reconnect if necessary. Does this happen all the time? Occasionally? Or only if a connection has dropped?
The error message implies the lock has become poisoned, that normally happens if a panic happens while the lock is locked: https://doc.rust-lang.org/std/sync/struct.RwLock.html#poisoning so there may be a different panic shortly before this happens? That might be the root-cause and this is a downstream side-effect.
Could you provide more information about what's happening at the same time?
From what I can gather we saw a panic like this immediately prior in one instance
panic: A connected state shouldn't be attempting to reconnect, file: /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/redis-async-0.6.5/src/reconnect.rs:213:33
I have done some digging through our logs and I'm now confident that the panic I posted above is indeed what triggered the poisoned mutex.
Thanks for this.
That first panic is caused by a race conditions that ought to be impossible. But, upon closer inspection it is potentially possible for two attempts to reconnect the same connection to be kicked off in parallel.
I haven't seen this happen in other uses of the library, but we must have different usage patterns and yours is more likely to trigger the problem.
There are a few potential fixes:
The race condition was caused because the read
side of a RwLock was relinquished before the attempt to reconnect the connection (which requires a write
lock). So if we used a Mutex instead we avoid this. This is what #61 fixes. The duration of the lock is still quite short in practice, as it only needs to lock long enough to check the state.
Not panic when if there are concurrent reconnections, silent drop the second one instead if the connection has been re-established in the first place. It panics at the moment as it should (in theory) be unreachable code.
Both 1 and 2.
Sounds good! Thanks for looking into it. I guess first one solution sounds like the safest choice 😊
Is there any chance you would consider a backport of the fix in a 0.6.6 release? We can't quite use 0.7 yet because we're waiting on hyper 0.14.
Yes, I'll do that. I need to review it again to make sure my assumptions are still sound, and test it a bit more, but when it's ready I'll include it in both the 0.6 and 0.7 branches.
Awesome! 🙏
This is now released on 0.6.6
and 0.7.3
Thank you for addressing this issue so quickly.
Hey! I'm seeing errors like this:
I cannot quite figure out under which circumstances that can happen. Do you have any hints?
Note that I'm on version
0.6.5
. Do you think updating would impact this issue?