[CRITICAL] Consumer encountered an unexpected error, AttributeError: 'NoneType' object has no attribute 'decode'

vkatochka commented 4 years ago

dramatiq 1.7.0

lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.3 LTS Release: 18.04 Codename: bionic

dramatiq bwfut -Q home -p1 -t1 --pid-file /var/www/.../tmp/dsrv-home-bwfut.pid --log-file /var/www/.../logs/dsrv-home-bwfut.log --path /var/www/.../ & disown

[2020-01-15 22:39:17,008] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:39:17,011] [PID 17867] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:39:17,012] [PID 17867] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:39:18,127] [PID 17867] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:39:18,260] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:39:18,280] [PID 19075] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:43:48,642] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:43:48,643] [PID 19075] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:43:49,357] [PID 19075] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:43:50,450] [PID 19075] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:43:50,613] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:43:50,634] [PID 20225] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:45:18,960] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:45:18,962] [PID 20225] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:45:18,987] [PID 20225] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:45:20,018] [PID 20225] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:45:20,155] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:45:20,176] [PID 21376] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:46:48,734] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:46:48,735] [PID 21376] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:46:49,521] [PID 21376] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:46:51,074] [PID 21376] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:46:51,207] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:46:51,230] [PID 22532] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:50:26,492] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to worker processes...
[2020-01-15 22:50:26,494] [PID 22532] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2020-01-15 22:50:27,059] [PID 22532] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
[2020-01-15 22:50:28,141] [PID 22532] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
[2020-01-15 22:50:28,276] [PID 17861] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.7.0' is booting up.
[2020-01-15 22:50:28,297] [PID 23687] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2020-01-15 22:57:00,399] [PID 23687] [Thread-3] [dramatiq.worker.ConsumerThread(home)] [CRITICAL] Consumer encountered an unexpected error.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/dramatiq/worker.py", line 253, in run
    for message in self.consumer:
  File "/usr/local/lib/python3.6/dist-packages/dramatiq/brokers/redis.py", line 318, in __next__
    message = Message.decode(data)
  File "/usr/local/lib/python3.6/dist-packages/dramatiq/message.py", line 96, in decode
    return cls(**global_encoder.decode(data))
  File "/usr/local/lib/python3.6/dist-packages/dramatiq/encoder.py", line 52, in decode
    return json.loads(data.decode("utf-8"))
AttributeError: 'NoneType' object has no attribute 'decode'
[2020-01-15 22:57:00,402] [PID 23687] [Thread-3] [dramatiq.worker.ConsumerThread(home)] [INFO] Restarting consumer in 3.00 seconds.

I copied worker, reload with command "kill -HUP 17861", and send one task in queue (repeated several times, this can be seen from the log). At some point, the error appeared in the log. After this error, tasks runned and finished normally.

AgeofCreations commented 4 years ago

Are you sure you have an access to your message broker? I've got the same error when my broker was down.

vkatochka commented 4 years ago

Yes, Redis worked. Before and after that, I set tasks in Redis and everything worked. I cannot reproduce the error, but I will observe.

synweap15 commented 4 years ago

I've seen this pop up after worker reloads with HUP since 1.3.0 using Redis broker, but it seems no messages are being dropped.

AgeofCreations commented 4 years ago

Well, this error is also occured to me then Redis worked. So i subscribe this issue

Bogdanp commented 4 years ago

I think this is a duplicate of #224. Please give 1.8.0 a try (it will be released later today) and let me know in that issue if you still encounter this problem.

tonycosentini commented 4 years ago

I was seeing this on 1.7.0 and am still seeing it on 1.8.0.

We're running Dramatiq on Heroku, whenever a new release is deployed we get one instance of this crash per Dramatiq worker.

Let me know if there's anything else that would be helpful to debug this.

tonycosentini commented 4 years ago

Some follow up on this - I think this bug is actually fixed in 1.8.0.

Sorry for the hazy details, but we had some events that were in a weird stuck state from this bug. A broker would hold on to them forever. I ended up re-queuing them and everything is good now - not seeing this exception anymore during deploys.

Bogdanp commented 4 years ago

Thank you for following up! Closing this again. If anyone else runs into this problem on 1.8, please let me know.

mparent61 commented 4 years ago

We started seeing this exact issue recently with both 1.8 and 1.9, after initially running without any problems for several months.

I suspect it's related to worker reloads re-enqueing messages.

A possible clue, our Sentry Redis integration indicates that the last Redis command prior to the exception sets the "do maintenance" flag to 1, not sure if it's related (see 2nd-to-last column below):

redis | EVALSHA 'e9668bc413bd4a2d63c8108b124f5b7df0d01263' 1 'api-broker' 'fetch' 1594306262596 'default.DQ' '43422391-f71b-4493-82b5-25ca6f01534a' 60000 604800000 1 7992

Is there any command we could run to purge/clear everything in Redis, including any messages held on to be workers? Or will these stale messages eventually timeout in a week (if I read expiration code correctly)?

Bogdanp commented 4 years ago

Thanks for the report, @mparent61 . Was there any other relevant information in your Redis log around that time? How often does this occur?

As far as clearing out Redis, the safest approach would be to stop all the workers then delete all the keys in Redis prefixed by dramatiq. If you only use Redis w/ Dramatiq, then it would also be safe to just run flushall.

mparent61 commented 4 years ago

It occurs on every deployment (stop workers, start new workers), usually 1-10 minutes after the new workers start up, and only 1 new worker has the error each time, then restarts and seems to be fine.

This only occurs in 1 of our 2 separate Redis/Dramatiq clusters, so seems like our cluster is somehow stuck in a bad state.

I used a flushall via Redis CLI and that seems to have fixed the problem - thanks for the suggestion.

Unfortunately I don't have any Redis logs available (we're using AWS ElastiCache and they're not on by default) - but will check logs if this starts happening again.

denizdogan commented 3 years ago

For what it's worth, I'm seeing this in 1.10.0 as well. I don't recall it ever happening before, but now it's happening. Curious to know if it has anything to do with the redis dependency version or something.

Incognito commented 3 years ago

I have lots of these errors per week in one project running 1.9.0, so it is at-least that old. Not sure what version it was introduced in however.

We run using the Redis broker, maybe that's the common factor here (as flushall seemed to resovle it in a previous comment). Is bad data possibly being requeued? We also use sentry (a previous post that discusses the requeue theory has this setup too)

denizdogan commented 3 years ago

We run using the Redis broker, maybe that's the common factor here (as flushall seemed to resovle it in a previous comment). Is bad data possibly being requeued? We also use sentry (a previous post that discusses the requeue theory has this setup too)

We use Sentry as well, so +1 on that possibility I guess?

synweap15 commented 3 years ago

I'm using Bugsnag, not Sentry, and I'm seeing those as well

mwek commented 3 years ago

The exact same exception is raised whenever one of the Redis message IDs are not present in the .msgs HSET. This is because of the HMGET call that will return empty results (later translated to None) if the message ID is not found. IIUC, this will cause all Redis messages in the cache (RedisConsumer.message_cache) to be lost and not requeued. One can reproduce this by pushing a bogus Redis ID on the queue (RPUSH dramatiq:default this_id_does_not_exist).

IIUC, duplicates message IDs can be pushed to the queue when two workers independently perform the maintenance of the same dead worker.

@Bogdanp the simplest fix for this will be to skip over the None messages returned from Redis in the __next__ call in the RedisBroker. However, I'm not sure if this will just make the error even harder to debug in case of a race condition (Redis ID first put in the queue, then to the .msgs HSET). I'm happy to prepare a pull request with such an implementation. Any thoughts here?

Bogdanp commented 3 years ago

IIUC, duplicates message IDs can be pushed to the queue when two workers independently perform the maintenance of the same dead worker.

That's not possible. When that Lua script runs, it's the only thing that can run on the Redis DB. Think of it kind of like a serializable transaction; other scripts have to wait for it to finish before they can run.

For that same reason, it seems unlikely that you'd have a id in the queue without a corresponding entry in the .msgs hash, unless you have key expiration turned on. Then again, there could definitely be some kind of weird interaction where the state of things ends up that way.

Maybe the right fix here might be to make the lua script a little more defensive:

change ack and nack to check that srem succeeds before doing anything else
change the rpush in the do_maintenance block to only push if the message hexists in queue_messages

william-scalexp commented 3 years ago

Would love to see this patch go out in a release. Using dramatiq redis broker and ran into this error. It started happening when I was testing the workers behavior post unexpected shutdown whilst running a long task.

azhard4int commented 3 years ago

We are noticing this error frequently in our production environment too. A patch/fix would be really helpful as it is flooding our sentry with errors.

Bogdanp commented 3 years ago

@azhard4int are you running v1.11?

Bogdanp commented 1 year ago

Closing for now since no new issues have been reported in a while. Feel free to reopen if you're still impacted by this.

Bogdanp / dramatiq

[CRITICAL] Consumer encountered an unexpected error, AttributeError: 'NoneType' object has no attribute 'decode' #266