Closed vkatochka closed 1 year ago
Are you sure you have an access to your message broker? I've got the same error when my broker was down.
Yes, Redis worked. Before and after that, I set tasks in Redis and everything worked. I cannot reproduce the error, but I will observe.
I've seen this pop up after worker reloads with HUP since 1.3.0 using Redis broker, but it seems no messages are being dropped.
Well, this error is also occured to me then Redis worked. So i subscribe this issue
I think this is a duplicate of #224. Please give 1.8.0 a try (it will be released later today) and let me know in that issue if you still encounter this problem.
I was seeing this on 1.7.0 and am still seeing it on 1.8.0.
We're running Dramatiq on Heroku, whenever a new release is deployed we get one instance of this crash per Dramatiq worker.
Let me know if there's anything else that would be helpful to debug this.
Some follow up on this - I think this bug is actually fixed in 1.8.0.
Sorry for the hazy details, but we had some events that were in a weird stuck state from this bug. A broker would hold on to them forever. I ended up re-queuing them and everything is good now - not seeing this exception anymore during deploys.
Thank you for following up! Closing this again. If anyone else runs into this problem on 1.8, please let me know.
We started seeing this exact issue recently with both 1.8 and 1.9, after initially running without any problems for several months.
I suspect it's related to worker reloads re-enqueing messages.
A possible clue, our Sentry Redis integration indicates that the last Redis command prior to the exception sets the "do maintenance" flag to 1
, not sure if it's related (see 2nd-to-last column below):
redis | EVALSHA 'e9668bc413bd4a2d63c8108b124f5b7df0d01263' 1 'api-broker' 'fetch' 1594306262596 'default.DQ' '43422391-f71b-4493-82b5-25ca6f01534a' 60000 604800000 1 7992
Is there any command we could run to purge/clear everything in Redis, including any messages held on to be workers? Or will these stale messages eventually timeout in a week (if I read expiration code correctly)?
Thanks for the report, @mparent61 . Was there any other relevant information in your Redis log around that time? How often does this occur?
As far as clearing out Redis, the safest approach would be to stop all the workers then delete all the keys in Redis prefixed by dramatiq
. If you only use Redis w/ Dramatiq, then it would also be safe to just run flushall
.
It occurs on every deployment (stop workers, start new workers), usually 1-10 minutes after the new workers start up, and only 1 new worker has the error each time, then restarts and seems to be fine.
This only occurs in 1 of our 2 separate Redis/Dramatiq clusters, so seems like our cluster is somehow stuck in a bad state.
I used a flushall
via Redis CLI and that seems to have fixed the problem - thanks for the suggestion.
Unfortunately I don't have any Redis logs available (we're using AWS ElastiCache and they're not on by default) - but will check logs if this starts happening again.
For what it's worth, I'm seeing this in 1.10.0 as well. I don't recall it ever happening before, but now it's happening. Curious to know if it has anything to do with the redis
dependency version or something.
I have lots of these errors per week in one project running 1.9.0, so it is at-least that old. Not sure what version it was introduced in however.
We run using the Redis broker, maybe that's the common factor here (as flushall
seemed to resovle it in a previous comment). Is bad data possibly being requeued? We also use sentry (a previous post that discusses the requeue theory has this setup too)
We run using the Redis broker, maybe that's the common factor here (as
flushall
seemed to resovle it in a previous comment). Is bad data possibly being requeued? We also use sentry (a previous post that discusses the requeue theory has this setup too)
We use Sentry as well, so +1 on that possibility I guess?
I'm using Bugsnag, not Sentry, and I'm seeing those as well
The exact same exception is raised whenever one of the Redis message IDs are not present in the .msgs
HSET. This is because of the HMGET
call that will return empty results (later translated to None
) if the message ID is not found. IIUC, this will cause all Redis messages in the cache (RedisConsumer.message_cache
) to be lost and not requeued. One can reproduce this by pushing a bogus Redis ID on the queue (RPUSH dramatiq:default this_id_does_not_exist
).
IIUC, duplicates message IDs can be pushed to the queue when two workers independently perform the maintenance of the same dead worker.
@Bogdanp the simplest fix for this will be to skip over the None
messages returned from Redis in the __next__
call in the RedisBroker. However, I'm not sure if this will just make the error even harder to debug in case of a race condition (Redis ID first put in the queue, then to the .msgs
HSET). I'm happy to prepare a pull request with such an implementation. Any thoughts here?
IIUC, duplicates message IDs can be pushed to the queue when two workers independently perform the maintenance of the same dead worker.
That's not possible. When that Lua script runs, it's the only thing that can run on the Redis DB. Think of it kind of like a serializable transaction; other scripts have to wait for it to finish before they can run.
For that same reason, it seems unlikely that you'd have a id in the queue without a corresponding entry in the .msgs
hash, unless you have key expiration turned on. Then again, there could definitely be some kind of weird interaction where the state of things ends up that way.
Maybe the right fix here might be to make the lua script a little more defensive:
ack
and nack
to check that srem
succeeds before doing anything elserpush
in the do_maintenance
block to only push if the message hexists
in queue_messages
Would love to see this patch go out in a release. Using dramatiq redis broker and ran into this error. It started happening when I was testing the workers behavior post unexpected shutdown whilst running a long task.
We are noticing this error frequently in our production environment too. A patch/fix would be really helpful as it is flooding our sentry with errors.
@azhard4int are you running v1.11?
Closing for now since no new issues have been reported in a while. Feel free to reopen if you're still impacted by this.
dramatiq 1.7.0
I copied worker, reload with command "kill -HUP 17861", and send one task in queue (repeated several times, this can be seen from the log). At some point, the error appeared in the log. After this error, tasks runned and finished normally.