celery / kombu

Messaging library for Python.
http://kombu.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
2.89k stars 930 forks source link

Table empty or key no longer exists #1063

Closed sreedharbukya closed 3 years ago

sreedharbukya commented 5 years ago

The issue with redis key getting evicted every time. I read an old issue link. I have confirmed that my Redis instance is not hacked. In fact, we are using Secured Redis.

OperationalError("\nCannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.\n",)

kombu==4.5.0 celery==4.3.0 redis==3.2.1

Is this some issue with redis?

tbolis commented 5 years ago

same issue here, is there any workaround on this? celery workers will freeze after that and we need a restart

thedrow commented 5 years ago

This can happen if you set Redis to LRU mode or something similar. Please configure Redis correctly and increase the memory of your Redis instance.

Feel free to comment if you find this is still an issue with Kombu.

ra-coder commented 5 years ago

what you mean by 'configure Redis correctly' ?

I have same problem in flask app with next config.py

# redis
REDIS_URL = os.environ['REDIS_URL']

# flask-caching
CACHE_TYPE = 'redis'
CACHE_KEY_PREFIX = 'glue_flask_cache_'
CACHE_REDIS_URL = REDIS_URL
thedrow commented 5 years ago

Check your redis.conf. Specifically maxmemory-policy. If its set to noeviction or does not have a value we may have a problem in Celery.

danleyb2 commented 5 years ago

faced this same issue on the first queue whenever i started a second or more queues

fixed by downgrading kombu==4.5.0 from kombu==4.6.5

had nothing to do with redis. just the missing key _kombu.binding.reply.celery.pidbox that is never created if you redis-cli monitor

LuRsT commented 5 years ago

I found the same issue, @danleyb2, did you figure out what the problem was with the current version?

Update: Downgrading to v4.5.0 solved the issue. Thanks @danleyb2

auvipy commented 5 years ago

This is present in celery integration Redis tests as well!

LuRsT commented 5 years ago

I noticed @auvipy, any plans on fixing it? Do you need any help?

auvipy commented 5 years ago

yes if you have time!

StingyJack commented 5 years ago

I was having this problem with kombu 4.5.0 when using celery as a service in a docker-compose pod that included a redis server image and a few app images. When I used up -d <serviceName> and started services individually, starting with Redis, the error would show up in the logs repeatedly. when I used up -d without a service name, this problem seemed to go away.

Edit: the version I named is likely incorrect. Our project's setup.py record was missing a comma between version ranges so it was using or applying whatever version was above the concatenation of the min and max versions. Which would have been the affected package version at some times.

killthekitten commented 5 years ago

Looks like the reason is #1087. The bug showed up last week, after 4.6.4 -> 4.6.5 migration.

killthekitten commented 5 years ago

@auvipy could you point to the failing integration test, please? I couldn't reproduce the bug locally, so I just pinned the version to 4.6.4 blindly.

boomxy commented 5 years ago

Looks like the reason is #1087. The bug showed up last week, after 4.6.4 -> 4.6.5 migration.

thanks you, 4.6.4 it works!

jorijinnall commented 5 years ago

Had the same issue. I fixed by downgrading kombu from 4.6.5 to 4.6.3 I still had the bug in version 4.6.4

travishen commented 5 years ago

same issue here

celery==4.3.0
redis==3.2.1
kombu==4.6.3  # downgrade meant for an flower issue https://github.com/mher/flower/issues/909

I found the error start to occur at worker recreate (e.g. k8s pod scaling), and affect to all the other workers. The worker has additional settings: concurrency(prefork) and max-memory -per-child

kravietz commented 5 years ago

kombu==4.6.3 fixed it for me -- had the same issue with Celery worker crashing.

auvipy commented 5 years ago

what about kombu==4.6.4?

chris-griffin commented 5 years ago

Downgrading from 4.6.5 to 4.6.4 worked for us @auvipy when using celery 4.4.0rc3 (with https://github.com/celery/celery/commit/8e34a67bdb95009df759d45c7c0d725c9c46e0f4 cherry picked on top to address a different issue)

killthekitten commented 4 years ago

@auvipy why was this closed?

auvipy commented 4 years ago

isnt it fixed with 4.6.6?

killthekitten commented 4 years ago

@auvipy it was impossible to tell from this thread (I follow every comment). Thanks!

LuRsT commented 4 years ago

@killthekitten It seems to be fixed, last month we stopped freezing kombu and it seems to be working with 4.6.6.

We use it with celery btw.

jison commented 4 years ago

i had the same issue.

redis==3.2.1 celery==4.4.0 kombu==4.6.7

LuRsT commented 4 years ago

@Jison I got redis==3.3.11 over here, not sure if that's the cause of the issue, but it won't hurt to upgrade it.

jison commented 4 years ago

@Jison I got redis==3.3.11 over here, not sure if that's the cause of the issue, but it won't hurt to upgrade it.

i got this.

InconsistencyError: Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists. Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.

File "kombu/connection.py", line 439, in _reraise_as_library_errors yield File "kombu/connection.py", line 518, in _ensured return fun(*args, kwargs) File "kombu/messaging.py", line 203, in _publish mandatory=mandatory, immediate=immediate, File "kombu/transport/virtual/base.py", line 605, in basic_publish message, exchange, routing_key, kwargs File "kombu/transport/virtual/exchange.py", line 70, in deliver for queue in _lookup(exchange, routing_key): File "kombu/transport/virtual/base.py", line 714, in _lookup self.get_table(exchange), File "kombu/transport/redis.py", line 839, in get_table raise InconsistencyError(NO_ROUTE_ERROR.format(exchange, key)) OperationalError: Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists. Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.

File "celery/worker/pidbox.py", line 46, in on_message self.node.handle_message(body, message) File "kombu/pidbox.py", line 145, in handle_message return self.dispatch(body) File "kombu/pidbox.py", line 115, in dispatch ticket=ticket) File "kombu/pidbox.py", line 151, in reply serializer=self.mailbox.serializer) File "kombu/pidbox.py", line 285, in _publish_reply opts File "kombu/messaging.py", line 181, in publish exchange_name, declare, File "kombu/connection.py", line 551, in _ensured errback and errback(exc, 0) File "python3.6/contextlib.py", line 99, in exit self.gen.throw(type, value, traceback) File "kombu/connection.py", line 444, in _reraise_as_library_errors sys.exc_info()[2]) File "vine/five.py", line 194, in reraise raise value.with_traceback(tb) File "kombu/connection.py", line 439, in _reraise_as_library_errors yield File "kombu/connection.py", line 518, in _ensured return fun(*args, kwargs) File "kombu/messaging.py", line 203, in _publish mandatory=mandatory, immediate=immediate, File "kombu/transport/virtual/base.py", line 605, in basic_publish message, exchange, routing_key, kwargs File "kombu/transport/virtual/exchange.py", line 70, in deliver for queue in _lookup(exchange, routing_key): File "kombu/transport/virtual/base.py", line 714, in _lookup self.get_table(exchange), File "kombu/transport/redis.py", line 839, in get_table raise InconsistencyError(NO_ROUTE_ERROR.format(exchange, key))

staticfox commented 4 years ago

I'm still seeing this issue with 4.6.7.

celery==4.4.0 hiredis==1.0.1 kombu==4.6.7 redis==3.4.1

image

Edit: I've ensured timeout is 0 and the memory policy is noeviction. I've also set my workers with --without-heartbeat --without-mingle --without-gossip and we're still seeing the errors. Only thing that comes to mind is if that particular set is empty, the key gets deleted regardless of settings as per redis spec: https://redis.io/topics/data-types-intro#automatic-creation-and-removal-of-keys.

ryancesiel commented 4 years ago

We have also seen this with: celery==4.4.0 kombu==4.6.7 redis==3.4.1

and

kombu==4.5.0 celery==4.3.0 redis==3.2.1

Our experience has been that this runs for a period of time (anywhere from ~6 days to 28 days) successfully before a worker fails out and stops consuming tasks. We've ruled out timeout is 0 and memory policy is allkeys-lru.


Today, I was inspecting the "_kombu.binding.reply.celery.pidbox" key and noticed it is transient and seems to only be set while workers are processing tasks (i.e. I only see it in Redis when workers are processing tasks). When it exists in Redis, I observe it has no expiration and is a set:

> TTL "_kombu.binding.reply.celery.pidbox"
(integer) -1
> TYPE "_kombu.binding.reply.celery.pidbox"
set

This would suggest that the key is explicitly created and deleted OR, as @staticfox noted, the set is losing all members and being deleted by Redis, but Celery expects it to exist.

I also found this old issue log, https://github.com/celery/kombu/issues/226, which pointed to fanout_prefix and fanout_patterns in broker_transport_options. I believe this only affects shared Redis clusters for multiple Celery apps (we are the only tenant on ours)?

This does not appear to be set in our app when initializing via celery.config_from_object:

print(celery_ctx.celery.conf.humanize(with_defaults=True))
...
broker_transport_options: {
 }
...

@auvipy - should this be re-opened based on recent reports?

auvipy commented 4 years ago

I am reopening, but can you try the latest celery==4.4.2 and reproduce this again?

staticfox commented 4 years ago

Same issue here after bumping to celery 4.4.2. EDIT: maybe not? EDIT 2: Nope, still same issue.

ryancesiel commented 4 years ago

We're deploying celery==4.4.2 and kombu==4.6.8 today, but I don't expect this will manifest right away (for us, it's not reliably reproducible and usually takes some time).

staticfox commented 4 years ago

Yeah, we're still seeing this pretty regularly with everything updated. I put this together during a lunch break, let me know if it helps or if I can provide any additional information or testing.

https://gist.github.com/staticfox/ee78380ff131487e0cc8175cc785330f

ryancesiel commented 4 years ago

This has reproduced twice since deploying celery 4.4.2 and kombu 4.6.8 for us. I'll update here if I find more information.

auvipy commented 4 years ago

is it kombu issue or your redis conf? can you dig more deeper?

ryancesiel commented 4 years ago

is it kombu issue or your redis conf? can you dig more deeper?

There's nothing odd in our redis conf based on everything I've reviewed from this thread and others: timeout is 0 and memory policy is allkeys-lru. Although we have an LRU policy, we never come close to our peak memory capacity so the LRU policy shouldn't be invoked.

I'm assuming this is a kombu issue since the exception trace originates from kombu, but I have no evidence beyond that:

kombu.exceptions.OperationalError: 
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.

Other notes on our configuration

This started happening after upgrading to Celery 4.x

We upgraded from Celery 3.1.25 to Celery 4.3, kombu 4.6.3 in December 2019 and noticed this error manifest 28 days after the upgrade.

We downgraded to Celery 4.2.1, kombu 4.5.0 and redis 3.2 and had this manifest multiple times.

We recently upgraded to Celery 4.4.0 and later Celery 4.4.2, and each time this occurred several times more.

We use autoscaling

We do use autoscaling, which various issue logs have said is pseudo-deprecated in Celery 4.x (maybe coming back in 4.5/4.6/5.x). This OperationalError exception tends to occur during peak periods when autoscaling scales us up, but this isn't always the case.

Other than autoscaling, our configuration is fairly basic: 3 workers for ad-hoc jobs, --autoscale=25,5 and 3 workers processing periodic, scheduled jobs --autoscale=5,1 (6 total worker nodes) with low utilization outside of a few daily spikes.

I'll continue investigating for patterns or anomalies.

fspot commented 4 years ago

I don't know if this can help but we encounter the same issue with celery 4.4.0, kombu 4.6.3 or 4.6.7, and redis 3.4.1. We do use autoscaling, too.

auvipy commented 4 years ago

Edit: I've ensured timeout is 0 and the memory policy is noeviction. I've also set my workers with --without-heartbeat --without-mingle --without-gossip and we're still seeing the errors. Only thing that comes to mind is if that particular set is empty, the key gets deleted regardless of settings as per redis spec: https://redis.io/topics/data-types-intro#automatic-creation-and-removal-of-keys

the01 commented 4 years ago

We have the same issue. No autoscaling, but a long running task in a docker container. It does not occur for the actual task, but for the docker health check command (celery -A worker inspect ping -d celery@celery_host -t 15)

x-7 commented 4 years ago

seems many bugs with redis broker ,is rabbitmq a better choice than redis?

auvipy commented 4 years ago

as a broker definitely Rabbitmq is better than Redis in most of the case!

staticfox commented 4 years ago

I'm wondering if https://github.com/celery/celery/issues/6009 could inadvertently be shining light on this particular issue... We're seeing memory ballooning when we use inspect ping, so perhaps the worker stalling could cause the reply to lag, to the point where another thread might have already replied and the first thread replies later after the key has been removed (relating to Redis deleting empty keys). I'm not too keen on Celery's inspect internals, but I'm starting to think this particular issue is only happening as a side affect of the memory leak. Could anyone else chime in if they are noticing a large consumption of memory prior to workers freezing?

image

hsabiu commented 4 years ago

I started noticing this error after upgrading Celery to 4.4.2 and kombu to 4.6.8. I read through most of the suggestions in this thread to downgrade Kombu to previous versions but that did not work for me.

What eventually end-up working for me is upgrading Redis server from version 3.2.11 to 5.0.8. Since the upgrade, I have not seen this error again and my celery worker systemd service is not going into failed state anymore.

auvipy commented 4 years ago

:D

staticfox commented 4 years ago

We have upgraded to 5.0.6 as well and we're still seeing this issue... @hsabiu can you clarify what was changed between Redis versions that caused the problem to go away? @auvipy closed the issue, so I must be missing something here.

hsabiu commented 4 years ago

We have upgraded to 5.0.6 as well and we're still seeing this issue... @hsabiu can you clarify what was changed between Redis versions that caused the problem to go away? @auvipy closed the issue, so I must be missing something here.

@staticfox I'm not sure what changes between Redis versions. I'm merely stating what works in my case. I tried downgrading to previous versions of Celery and Kombu but that doesn't seem to fix the issue. Bumping Redis to 5.0.8 with Celery 4.4.2 and Kombu 4.6.8 is what worked for me.

the01 commented 4 years ago

I have seen it with redis at 5.0.8 as well. Sometimes it is there, but other times it works. Mostly when I started to investigate the issue, it would not come up in my dev environment..

staticfox commented 4 years ago

@hsabiu that's understandable. @auvipy, could you elaborate on why you believe upgrading Redis resolves the issue when others have stated that they are still facing it after upgrading to latest Redis? You closed this issue, so I'm simply still trying to find the resolution that you have found.

ryancesiel commented 4 years ago

I've been running Redis server 5.0.2 with Celery 3.1.25 and then upgraded to Celery 4.3.0, 4.4.0 and 4.4.2. and experienced this error on each 4.x release. Similar to @the01, this issue doesn't reproduce reliably

Unfortunately, I can't upgrade the Redis server version we use, but I would be surprised if a patch update resolved this, especially since we did not encounter this with Celery 3.x.

auvipy commented 4 years ago

I've been running Redis server 5.0.2 with Celery 3.1.25 and then upgraded to Celery 4.3.0, 4.4.0 and 4.4.2. and experienced this error on each 4.x release. Similar to @the01, this issue doesn't reliably

Unfortunately, I can't upgrade the Redis server version we use, but I would be surprised if a patch update resolved this, especially since we did not encounter this with Celery 3.x.

you need to find out your problem.

wanaryytel commented 4 years ago

I agree that it's confusing that this issue is closed, although no reliable solution has been proposed and this is still manifesting. We're seeing this in Redis 5.0.3 and Celery 4.3.0, but it seems that the specific versions are not very helpful in this case.

auvipy commented 4 years ago

try celery latest==4.4.2+ and report again