Don't panic when Redis is unavailable

sandepudi commented 9 years ago

Would it be possible for your functions to not panic when Redis is unavailable? Especially in goroutines that are started by functions like queue.StartConsuming. Since there's no way to recover from a panicked "child" goroutine, we are unable to handle cases when our Redis instance becomes temporarily unavailable and our consumers begin dying.

May be queue.consume() could recover from redisErrIsNil's panic and return an error to StartConsuming (through a channel)? Or better yet, redisErrIsNil doesn't panic at all and simply returns errors that your users can handle :)

sandepudi commented 9 years ago

Or if you could let me know how you guys use rmq and handle such temporary Redis downtimes, that would be great too.

wellle commented 9 years ago

@sandepudi: Thanks for bringing this up. I started looking into the implementation and it looks like a mess. Check yourself and feel free to move that forward: https://github.com/adjust/rmq/compare/8-dont-panic

I just started to go through everything that panics and return errors where possible. I didn't even introduce channels yet to bring the errors back from distant goroutines. While fixing up the tests I gave up for now. There is error handling everywhere.

And for us this is mostly noise, because we just want to panic in all of those cases. What is a producer good for if he can't produce because the queue system is unreachable? What is a consumer good for if he can't reach it to consume packages. If the cleaner still has access, but the consumer doesn't, it will clean the consumer because it can't maintain its heartbeat.

I don't like how much this changes the interface. What do you think?

sandepudi commented 9 years ago

@wellle Thanks for the detailed response.

I understand your philosophy for using panics. But I feel that there are advantages of returning errors - the client can choose to handle them by say, waiting-and-retrying or sending an email saying that the redis queue is down etc. We loose all that ability if the program just exits.

Generally this approach of returning explicit errors is recommended. http://blog.golang.org/defer-panic-and-recover ("The convention in the Go libraries is that even when a package uses panic internally, its external API still presents explicit error return values."). However, as above I understand that this would require a lot of code change and in your set up it doesn't add a lot of value.

We really loved the concept of rmq and all the nifty utilities you provided for stats, cleaning and purging. However, in our current system we have multiple producers and multiple consumers running on multiple VMs all interacting with a single Redis cluster. So whenever Redis becomes temporarily unreachable (network issues, upgrading etc), all the programs will die without any notification and we will have to restart all of them - which is a bit painful :(

wellle commented 9 years ago

@sandepudi: Actually I just did some tests and learned that there seems to be some sort of automatic Redis reconnect I wasn't aware of. I updated example/producer.go and example/consumer.go to check if Publish, Ack and Reject succeeded and made the producer sleep on error.

While both were running, I shut down my local Redis (with launchctl on Mac OS), observed the error logs, then turned Redis on again and saw that both producer and consumer did in fact recover.

While this sounds nice in general, this can lead to the following problem:

consumer loses Redis connection
consumer can't maintain heartbeat because of lost Redis connection
cleaner (still having Redis connection) finds dead connection, cleans it up
now the consumer reconnects to Redis and starts it heartbeat again

But the cleaner already removed some meta information about the connection and the queues it consumed that are not recreated in this case. So we basically have a zombie connection now. It might still work, but if it dies now the cleaner can't clean it up properly, so unacked packages might be lost.

So for a safe cleanup it is currently important that connections whose heartbeat was lost don't just come back at random.

We could still try the following: When updating the heartbeat (increasing the TTL of the heartbeat key) we always first check if the heartbeat is currently alive (the key has a positive TTL). So when we find ourself in the situation where we want to update the heartbeat, but if we see we have been dead (and were possibly cleaned up) we don't set the key again (stay dead).

This means we would be able to recover from Redis connection errors as long as it comes back before the heartbeat duration passes (currently one minute). Any longer connection errors are still unrecoverable (unless we spend much more effort to properly repair the cleaner meta data). We could consider panicking in that unrecoverable case, or just keep failing forever.

Please let me know what you think about all that.

sandepudi commented 9 years ago

@wellle Your idea sounds great. If we can make the heartbeat duration user configurable, panicking in the final unrecoverable case would be fine I guess.

svperfecta commented 7 years ago

Hi dudes - Wondering where / if you landed anywhere on this. Like @sandepudi we want to use a redis cluster so that queues can be sharded (no single queue is bigger than a single server, but we'd like to increase overall throughput to a single queue because we read / write quite a bit)

I think Redis Cluster answers it nicely, but I saw this issue when searching for "redis cluster".

svperfecta commented 7 years ago

PS: We ship everything in docker, so I think the ultimate worst case of a panic is "docker restarts the container" which isn't really a huge deal (though not perfect!) Any thoughts would be mucho-appreciated :P

wellle commented 7 years ago

@genexp: Thanks for your interest! We didn't really made any progress here since then, so we're still in the same situation here. Personally we didn't have any issues related to this, so I'm not aware of the impact here. If you use rmq and see your application crash because of such short term Redis connection drops, then please let us know.

I think the strategy above is still generally feasible (check existence of heartbeat key before updating its TTL, or only allow a limited number of consecutive heartbeat update errors), but I'm still a bit reluctant to change the interface everywhere to allow returning errors in cases where we currently panic (on every Redis error basically).

Not to mention the case where we fail to ack a delivery because Redis is unavailable. That means you successfully handled a delivery, but can't reflect that in Redis. So if we end up panicking later on (because connection didn't recover), then you'd get a double delivery. Currently you still get that double delivery, but it should be only few (as we currently panic soon). If instead we'd prolong panicking, that might mean that we process more packages that will end up being delivered twice.

So in summary it seems like some effort to change this with potentially dangerous side effects.

wellle commented 6 years ago

@eightnoteight: Hey, thanks for the kind words! I'd very much invite you to work on this. Did you read all of the above and have a clear picture of what needs to be taken care of? Especially the heartbeat seems somewhat challenging to get right. Maybe you want to describe here how exactly you're planning to address those points before going all in?

roman-vynar commented 6 years ago

re: https://github.com/adjust/rmq/blob/master/redis_wrapper.go#L96

        log.Panicf("rmq redis error is not nil %s", err)

This code should not panic and crash the whole app. We are getting multiple panics on any single network timeout with a remote Redis. When we change this to

        log.Printf("rmq redis error is not nil %s", err)

it works just perfectly fine for a long time for us.

wellle commented 4 years ago

To whom it may concern: A PR has been opened to finally address this issue: https://github.com/adjust/rmq/pull/88

Would appreciate any feedback on the chosen approach 🙏

adjust / rmq

Don't panic when Redis is unavailable #8