Closed sreedharbukya closed 3 years ago
same issue here, is there any workaround on this? celery workers will freeze after that and we need a restart
This can happen if you set Redis to LRU mode or something similar. Please configure Redis correctly and increase the memory of your Redis instance.
Feel free to comment if you find this is still an issue with Kombu.
what you mean by 'configure Redis correctly' ?
I have same problem in flask app with next config.py
# redis
REDIS_URL = os.environ['REDIS_URL']
# flask-caching
CACHE_TYPE = 'redis'
CACHE_KEY_PREFIX = 'glue_flask_cache_'
CACHE_REDIS_URL = REDIS_URL
Check your redis.conf
.
Specifically maxmemory-policy
.
If its set to noeviction
or does not have a value we may have a problem in Celery.
faced this same issue on the first queue whenever i started a second or more queues
fixed by downgrading kombu==4.5.0
from kombu==4.6.5
had nothing to do with redis. just the missing key _kombu.binding.reply.celery.pidbox
that is never created if you redis-cli monitor
I found the same issue, @danleyb2, did you figure out what the problem was with the current version?
Update: Downgrading to v4.5.0 solved the issue. Thanks @danleyb2
This is present in celery integration Redis tests as well!
I noticed @auvipy, any plans on fixing it? Do you need any help?
yes if you have time!
I was having this problem with kombu 4.5.0 when using celery as a service in a docker-compose pod that included a redis server image and a few app images. When I used up -d <serviceName>
and started services individually, starting with Redis, the error would show up in the logs repeatedly. when I used up -d
without a service name, this problem seemed to go away.
Edit: the version I named is likely incorrect. Our project's setup.py record was missing a comma between version ranges so it was using or applying whatever version was above the concatenation of the min and max versions. Which would have been the affected package version at some times.
Looks like the reason is #1087. The bug showed up last week, after 4.6.4
-> 4.6.5
migration.
@auvipy could you point to the failing integration test, please? I couldn't reproduce the bug locally, so I just pinned the version to 4.6.4
blindly.
Looks like the reason is #1087. The bug showed up last week, after
4.6.4
->4.6.5
migration.
thanks you, 4.6.4
it works!
Had the same issue. I fixed by downgrading kombu from 4.6.5 to 4.6.3 I still had the bug in version 4.6.4
same issue here
celery==4.3.0
redis==3.2.1
kombu==4.6.3 # downgrade meant for an flower issue https://github.com/mher/flower/issues/909
I found the error start to occur at worker recreate (e.g. k8s pod scaling), and affect to all the other workers. The worker has additional settings: concurrency(prefork) and max-memory -per-child
kombu==4.6.3
fixed it for me -- had the same issue with Celery worker crashing.
what about kombu==4.6.4?
Downgrading from 4.6.5 to 4.6.4 worked for us @auvipy when using celery 4.4.0rc3 (with https://github.com/celery/celery/commit/8e34a67bdb95009df759d45c7c0d725c9c46e0f4 cherry picked on top to address a different issue)
@auvipy why was this closed?
isnt it fixed with 4.6.6?
@auvipy it was impossible to tell from this thread (I follow every comment). Thanks!
@killthekitten It seems to be fixed, last month we stopped freezing kombu and it seems to be working with 4.6.6.
We use it with celery btw.
i had the same issue.
redis==3.2.1 celery==4.4.0 kombu==4.6.7
@Jison I got redis==3.3.11
over here, not sure if that's the cause of the issue, but it won't hurt to upgrade it.
@Jison I got
redis==3.3.11
over here, not sure if that's the cause of the issue, but it won't hurt to upgrade it.
i got this.
InconsistencyError: Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists. Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.
File "kombu/connection.py", line 439, in _reraise_as_library_errors yield File "kombu/connection.py", line 518, in _ensured return fun(*args, kwargs) File "kombu/messaging.py", line 203, in _publish mandatory=mandatory, immediate=immediate, File "kombu/transport/virtual/base.py", line 605, in basic_publish message, exchange, routing_key, kwargs File "kombu/transport/virtual/exchange.py", line 70, in deliver for queue in _lookup(exchange, routing_key): File "kombu/transport/virtual/base.py", line 714, in _lookup self.get_table(exchange), File "kombu/transport/redis.py", line 839, in get_table raise InconsistencyError(NO_ROUTE_ERROR.format(exchange, key)) OperationalError: Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists. Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.
File "celery/worker/pidbox.py", line 46, in on_message self.node.handle_message(body, message) File "kombu/pidbox.py", line 145, in handle_message return self.dispatch(body) File "kombu/pidbox.py", line 115, in dispatch ticket=ticket) File "kombu/pidbox.py", line 151, in reply serializer=self.mailbox.serializer) File "kombu/pidbox.py", line 285, in _publish_reply opts File "kombu/messaging.py", line 181, in publish exchange_name, declare, File "kombu/connection.py", line 551, in _ensured errback and errback(exc, 0) File "python3.6/contextlib.py", line 99, in exit self.gen.throw(type, value, traceback) File "kombu/connection.py", line 444, in _reraise_as_library_errors sys.exc_info()[2]) File "vine/five.py", line 194, in reraise raise value.with_traceback(tb) File "kombu/connection.py", line 439, in _reraise_as_library_errors yield File "kombu/connection.py", line 518, in _ensured return fun(*args, kwargs) File "kombu/messaging.py", line 203, in _publish mandatory=mandatory, immediate=immediate, File "kombu/transport/virtual/base.py", line 605, in basic_publish message, exchange, routing_key, kwargs File "kombu/transport/virtual/exchange.py", line 70, in deliver for queue in _lookup(exchange, routing_key): File "kombu/transport/virtual/base.py", line 714, in _lookup self.get_table(exchange), File "kombu/transport/redis.py", line 839, in get_table raise InconsistencyError(NO_ROUTE_ERROR.format(exchange, key))
I'm still seeing this issue with 4.6.7.
celery==4.4.0 hiredis==1.0.1 kombu==4.6.7 redis==3.4.1
Edit: I've ensured timeout is 0 and the memory policy is noeviction. I've also set my workers with --without-heartbeat --without-mingle --without-gossip
and we're still seeing the errors. Only thing that comes to mind is if that particular set is empty, the key gets deleted regardless of settings as per redis spec: https://redis.io/topics/data-types-intro#automatic-creation-and-removal-of-keys.
We have also seen this with: celery==4.4.0 kombu==4.6.7 redis==3.4.1
and
kombu==4.5.0 celery==4.3.0 redis==3.2.1
Our experience has been that this runs for a period of time (anywhere from ~6 days to 28 days) successfully before a worker fails out and stops consuming tasks. We've ruled out timeout is 0 and memory policy is allkeys-lru.
Today, I was inspecting the "_kombu.binding.reply.celery.pidbox"
key and noticed it is transient and seems to only be set while workers are processing tasks (i.e. I only see it in Redis when workers are processing tasks). When it exists in Redis, I observe it has no expiration and is a set:
> TTL "_kombu.binding.reply.celery.pidbox"
(integer) -1
> TYPE "_kombu.binding.reply.celery.pidbox"
set
This would suggest that the key is explicitly created and deleted OR, as @staticfox noted, the set is losing all members and being deleted by Redis, but Celery expects it to exist.
I also found this old issue log, https://github.com/celery/kombu/issues/226, which pointed to fanout_prefix
and fanout_patterns
in broker_transport_options
. I believe this only affects shared Redis clusters for multiple Celery apps (we are the only tenant on ours)?
This does not appear to be set in our app when initializing via celery.config_from_object
:
print(celery_ctx.celery.conf.humanize(with_defaults=True))
...
broker_transport_options: {
}
...
@auvipy - should this be re-opened based on recent reports?
I am reopening, but can you try the latest celery==4.4.2 and reproduce this again?
Same issue here after bumping to celery 4.4.2. EDIT: maybe not? EDIT 2: Nope, still same issue.
We're deploying celery==4.4.2 and kombu==4.6.8 today, but I don't expect this will manifest right away (for us, it's not reliably reproducible and usually takes some time).
Yeah, we're still seeing this pretty regularly with everything updated. I put this together during a lunch break, let me know if it helps or if I can provide any additional information or testing.
https://gist.github.com/staticfox/ee78380ff131487e0cc8175cc785330f
This has reproduced twice since deploying celery 4.4.2 and kombu 4.6.8 for us. I'll update here if I find more information.
is it kombu issue or your redis conf? can you dig more deeper?
is it kombu issue or your redis conf? can you dig more deeper?
There's nothing odd in our redis conf based on everything I've reviewed from this thread and others: timeout is 0 and memory policy is allkeys-lru. Although we have an LRU policy, we never come close to our peak memory capacity so the LRU policy shouldn't be invoked.
I'm assuming this is a kombu issue since the exception trace originates from kombu, but I have no evidence beyond that:
kombu.exceptions.OperationalError:
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.
We upgraded from Celery 3.1.25 to Celery 4.3, kombu 4.6.3 in December 2019 and noticed this error manifest 28 days after the upgrade.
We downgraded to Celery 4.2.1, kombu 4.5.0 and redis 3.2 and had this manifest multiple times.
We recently upgraded to Celery 4.4.0 and later Celery 4.4.2, and each time this occurred several times more.
We do use autoscaling, which various issue logs have said is pseudo-deprecated in Celery 4.x (maybe coming back in 4.5/4.6/5.x). This OperationalError
exception tends to occur during peak periods when autoscaling scales us up, but this isn't always the case.
Other than autoscaling, our configuration is fairly basic: 3 workers for ad-hoc jobs, --autoscale=25,5
and 3 workers processing periodic, scheduled jobs --autoscale=5,1
(6 total worker nodes) with low utilization outside of a few daily spikes.
I'll continue investigating for patterns or anomalies.
I don't know if this can help but we encounter the same issue with celery 4.4.0, kombu 4.6.3 or 4.6.7, and redis 3.4.1. We do use autoscaling, too.
Edit: I've ensured timeout is 0 and the memory policy is noeviction. I've also set my workers with --without-heartbeat --without-mingle --without-gossip and we're still seeing the errors. Only thing that comes to mind is if that particular set is empty, the key gets deleted regardless of settings as per redis spec: https://redis.io/topics/data-types-intro#automatic-creation-and-removal-of-keys
We have the same issue. No autoscaling, but a long running task in a docker container. It does not occur for the actual task, but for the docker health check command (celery -A worker inspect ping -d celery@celery_host -t 15
)
seems many bugs with redis broker ,is rabbitmq a better choice than redis?
as a broker definitely Rabbitmq is better than Redis in most of the case!
I'm wondering if https://github.com/celery/celery/issues/6009 could inadvertently be shining light on this particular issue... We're seeing memory ballooning when we use inspect ping
, so perhaps the worker stalling could cause the reply to lag, to the point where another thread might have already replied and the first thread replies later after the key has been removed (relating to Redis deleting empty keys). I'm not too keen on Celery's inspect internals, but I'm starting to think this particular issue is only happening as a side affect of the memory leak. Could anyone else chime in if they are noticing a large consumption of memory prior to workers freezing?
I started noticing this error after upgrading Celery to 4.4.2 and kombu to 4.6.8. I read through most of the suggestions in this thread to downgrade Kombu to previous versions but that did not work for me.
What eventually end-up working for me is upgrading Redis server from version 3.2.11 to 5.0.8. Since the upgrade, I have not seen this error again and my celery worker systemd service is not going into failed state anymore.
:D
We have upgraded to 5.0.6 as well and we're still seeing this issue... @hsabiu can you clarify what was changed between Redis versions that caused the problem to go away? @auvipy closed the issue, so I must be missing something here.
We have upgraded to 5.0.6 as well and we're still seeing this issue... @hsabiu can you clarify what was changed between Redis versions that caused the problem to go away? @auvipy closed the issue, so I must be missing something here.
@staticfox I'm not sure what changes between Redis versions. I'm merely stating what works in my case. I tried downgrading to previous versions of Celery and Kombu but that doesn't seem to fix the issue. Bumping Redis to 5.0.8 with Celery 4.4.2 and Kombu 4.6.8 is what worked for me.
I have seen it with redis at 5.0.8 as well. Sometimes it is there, but other times it works. Mostly when I started to investigate the issue, it would not come up in my dev environment..
@hsabiu that's understandable. @auvipy, could you elaborate on why you believe upgrading Redis resolves the issue when others have stated that they are still facing it after upgrading to latest Redis? You closed this issue, so I'm simply still trying to find the resolution that you have found.
I've been running Redis server 5.0.2 with Celery 3.1.25 and then upgraded to Celery 4.3.0, 4.4.0 and 4.4.2. and experienced this error on each 4.x release. Similar to @the01, this issue doesn't reproduce reliably
Unfortunately, I can't upgrade the Redis server version we use, but I would be surprised if a patch update resolved this, especially since we did not encounter this with Celery 3.x.
I've been running Redis server 5.0.2 with Celery 3.1.25 and then upgraded to Celery 4.3.0, 4.4.0 and 4.4.2. and experienced this error on each 4.x release. Similar to @the01, this issue doesn't reliably
Unfortunately, I can't upgrade the Redis server version we use, but I would be surprised if a patch update resolved this, especially since we did not encounter this with Celery 3.x.
you need to find out your problem.
I agree that it's confusing that this issue is closed, although no reliable solution has been proposed and this is still manifesting. We're seeing this in Redis 5.0.3 and Celery 4.3.0, but it seems that the specific versions are not very helpful in this case.
try celery latest==4.4.2+ and report again
The issue with redis key getting evicted every time. I read an old issue link. I have confirmed that my Redis instance is not hacked. In fact, we are using Secured Redis.
OperationalError("\nCannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.\n",)
kombu==4.5.0 celery==4.3.0 redis==3.2.1
Is this some issue with redis?