celery / kombu

Messaging library for Python.
http://kombu.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
2.87k stars 928 forks source link

Latest working celery/redis cannot inspect: Error: No nodes replied within time constraint. #1087

Open jslusher opened 5 years ago

jslusher commented 5 years ago

Mandatory Debugging Information

Related Issues

Possible Duplicates

Environment & Settings

Celery version: 4.3.0

celery report Output:

``` software -> celery:4.3.0 (rhubarb) kombu:4.6.4 py:2.7.16 billiard:3.6.1.0 redis:3.2.1 platform -> system:Linux arch:64bit kernel version:3.10.0-957.27.2.el7.x86_64 imp:CPython loader -> celery.loaders.app.AppLoader settings -> transport:sentinel results:disabled CELERY_QUEUES: ( -> celery>, -> fast>, -> slow>, -> mp-fast>, -> mp-slow>) BROKER_TRANSPORT_OPTIONS: { 'master_name': 'staging'} BROKER_URL: u'sentinel://redis-s1.example.domain.com:26379//' CELERY_ALWAYS_EAGER: False CELERY_DISABLE_RATE_LIMITS: True CELERY_ACCEPT_CONTENT: ['json'] CELERYD_MAX_TASKS_PER_CHILD: 2000 CELERY_IMPORTS: ('tasks',) CELERY_EAGER_PROPAGATES_EXCEPTIONS: True CELERY_STORE_ERRORS_EVEN_IF_IGNORED: True CELERY_IGNORE_RESULT: True CELERY_TASK_SERIALIZER: 'json' ```

Steps to Reproduce

Required Dependencies

Python Packages

pip freeze Output:

``` ABN==0.4.2 address==0.1.1 akismet==1.0.1 amqp==2.5.1 asn1crypto==0.24.0 attrs==19.1.0 Authlib==0.11 Authomatic==0.0.13 awesome-slugify==1.6.2 Babel==2.6.0 backports.functools-lru-cache==1.5 billiard==3.6.1.0 bleach==1.5.0 boto==2.38.0 cachetools==3.1.1 cas-client==1.0.0 celery==4.3.0 certifi==2017.7.27.1 cffi==1.12.3 chardet==3.0.4 click==6.7 configparser==3.8.1 contextlib2==0.5.5 coverage==4.5.4 cryptography==2.0.3 cssselect==0.9.2 cycler==0.10.0 datadog==0.11.0 ddtrace==0.25.0 decorator==4.4.0 dnspython==1.16.0 docopt==0.4.0 docutils==0.15.2 elasticsearch==6.3.1 enum34==1.1.6 filelock==3.0.12 funcsigs==1.0.2 future==0.17.1 google-auth==1.6.2 hiredis==0.2.0 html5lib==0.9999999 httplib2==0.13.1 idna==2.8 importlib-metadata==0.19 ipaddress==1.0.22 isodate==0.5.4 itsdangerous==0.24 Jinja2==2.7.1 kafka-python==1.4.6 kiwisolver==1.1.0 kombu==4.6.4 lmtpd==6.0.0 lockfile==0.12.2 loginpass==0.2.1 lxml==3.6.1 mandrill==1.0.57 Markdown==2.2.1 MarkupSafe==0.18 matplotlib==2.2.4 mock==1.0.1 more-itertools==5.0.0 mysqlclient==1.3.9 netaddr==0.7.19 numpy==1.16.4 oauth2==1.9.0.post1 packaging==19.1 passlib==1.6.1 pathlib2==2.3.4 paypalrestsdk==0.6.2 Pillow==2.8.1 pluggy==0.6.0 psutil==5.6.3 py==1.8.0 pyasn1==0.4.6 pyasn1-modules==0.2.6 PyBabel-json==0.2.0 pybreaker==0.5.0 pycountry==18.2.23 pycparser==2.19 pycryptodome==3.8.2 PyJWT==0.4.1 pylibmc==1.6.0 pyparsing==2.4.2 pytest==3.5.0 pytest-cov==2.4.0 python-daemon==2.1.2 python-dateutil==2.1 pytz==2014.4 PyYAML==3.12 raven==5.31.0 redis==3.2.1 regex==2018.11.3 requests==2.7.0 rsa==4.0 salmon-mail==3.0.0 scandir==1.10.0 simple-db-migrate==3.0.0 simplejson==3.10.0 six==1.11.0 SQLAlchemy==1.0.6 subprocess32==3.5.4 sudz==1.0.3 termcolor==1.1.0 toml==0.10.0 tox==3.13.2 Unidecode==0.4.21 urllib3==1.25.3 uWSGI==2.0.17.1 vine==1.3.0 virtualenv==16.7.2 Werkzeug==0.11.15 WTForms==1.0.5 zipp==0.5.2 ```

Other Dependencies

N/A

Minimally Reproducible Test Case

```python ```

Expected Behavior

I expect celery -A app inspect ping (as well as other subcommands of celery inspect) to return output.

Actual Behavior

This configuration and version of celery/redis/sentinel has been working fine until just recently and I'm not sure what might have changed. I'm guessing it might have something to do with conflicting packages (given how many there are in this python env đź‘€ ) but I'm not sure what else to check. I can verify by looking at the keys in redis and also by using tcpdump that celery is definitely able to reach the redis servers using the sentinel brokers. The deployment of celery is also serving tasks and otherwise seems to be working normally. I can't though for some reason run any of the inspect like commands without getting Error: No nodes replied within time constraint.

The only thing I see in the debug logs is again proof that the celery workers are getting the message, but still nothing comes back:

[2019-08-20 16:34:23,472: DEBUG/MainProcess] pidbox received method ping() [reply_to:{'routing_key': 'dbc97d66-fe94-3d6d-aa6a-bb965893ae2b', 'exchange': 'reply.celery.pidbox'} ticket:19949cbb-6bf0-4b36-89f7-d5851c0bddd0]

We also captured redis traffic using MONITOR and we can see that pings are being keyed and populated: https://gist.github.com/jslusher/3b24f7676c93f90cc55e1330f6e595d8

auvipy commented 5 years ago

did you check the comments of this issue? https://github.com/celery/celery/issues/4688

halfdan commented 5 years ago

I can confirm this issue. It seems to be a result of a recent kombu release - 4.6.3 is working while 4.6.4 is not. I'm going to keep digging.

auvipy commented 5 years ago

I can confirm this issue. It seems to be a result of a recent kombu release - 4.6.3 is working while 4.6.4 is not. I'm going to keep digging.

should we move this issue to kombu repo then?

halfdan commented 5 years ago

For reference:

Working

software -> celery:4.3.0 (rhubarb) kombu:4.6.3 py:3.7.3
            billiard:3.6.1.0 redis:3.3.8
platform -> system:Darwin arch:64bit
            kernel version:18.6.0 imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://localhost:6379/

Not working

software -> celery:4.3.0 (rhubarb) kombu:4.6.4 py:3.7.3
            billiard:3.6.1.0 redis:3.3.8
platform -> system:Darwin arch:64bit
            kernel version:18.6.0 imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://localhost:6379/
halfdan commented 5 years ago

should we move this issue to kombu repo then?

Sounds reasonable.

halfdan commented 5 years ago

https://github.com/celery/kombu/issues/1081 seems related

auvipy commented 5 years ago

with celery 4.3 dont use kombu 4.6, could you try kombu 4.6.4 with celery 4.4.0rc3?

halfdan commented 5 years ago

@auvipy Happy to try, but if kombu 4.6 shouldn't be used with celery 4.3 I suggest to update https://github.com/celery/celery/blob/master/requirements/default.txt#L3 since it requires kombu>=4.6.4,<5.0 currently (master) and kombu>=4.4.0,<5.0.0 in the Celery 4.3.0 release. Celery 4.3.0 is broken as a result. Can we do a patch release 4.3.1 where kombu is set to kombu>=4.4.0,<4.6.4?

halfdan commented 5 years ago

@auvipy This is also broken in 4.4.0rc3

software -> celery:4.4.0rc3 (cliffs) kombu:4.6.4 py:3.7.3
            billiard:3.6.1.0 redis:3.3.8
platform -> system:Darwin arch:64bit
            kernel version:18.6.0 imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://localhost:6379/
auvipy commented 5 years ago

OK thanks for verifying

auvipy commented 5 years ago

may be https://github.com/celery/kombu/issues/1090 is a duplicate or related issue

yarinb commented 5 years ago

I see this issue is now closed. is it resolved in master? What’s the PR that fixes the problem?

auvipy commented 5 years ago

try this https://github.com/celery/kombu/pull/1089

matteius commented 5 years ago

@yarinb Let me know how it is, I provided proper unit test coverage but I want feedback, because I have extensive Celery/Kombu experience using AMQP and I also have some decent Redis background so I wanted to learn more about kombu library by trying to fix this regression but keeping the original Redis API optimization intent from @auvipy .

auvipy commented 5 years ago

https://github.com/celery/kombu/issues/1091

jslusher commented 5 years ago

Thanks for looking into the issue!

I'm a little confused about how to proceed now that this issue is closed. Is there a patch version for kombu on the way? Should I wait for that or should I lock my celery and kombu versions to something specific in my requirements.txt? I would rather not downgrade celery if I can help it. I also would like to keep a version lock of kombu itself out of my requirements.txt if possible, especially if there's a patch on the horizon.

auvipy commented 5 years ago

yes 4.6.5 is underway with celery 4.4.0rc4 / final

jacobbridges commented 5 years ago

Looking forward to the update! This issue was driving me crazy.

matteius commented 5 years ago

@jslusher I would definitely recommend pinning all of the celery requirements, for example:

note these are old

celery==4.2.1 kombu==4.2.2.post1 amqp==2.3.2 billiard==3.5.0.3

The reason to pin is so that you decide to upgrade from stable versions when you are ready to put time into monitoring and potentially troubleshooting any new issues. Often times in past releases version of Celery, kombu and py-amqp were paired in a way that weren't always compatible, especially if you pin something like Celery and let kombu update freely.

I still have not heard 100% confirmation that this patch I did has resolved these issues, but I would encourage you to pin kombu to master of github until a release is made in pypi including this patch, but if you can do this now you can help verify that not more work is required.

Pip will let you specify this master branch in your requirements by replacing your kombu requirement with: git+https://github.com/celery/kombu.git

AbrahamLopez10 commented 5 years ago

Was having this issue as well with kombu 4.6.4, downgraded it to version 4.6.3 and it now works.

joshlsullivan commented 5 years ago

I had the same issue. It seems when I installed celery, it also installed the development version of kombu, which is currently 4.6.5. I uninstall kombu and downgraded to the stable version, which is 4.5.0. It's working now.

bgmiles commented 3 years ago

Not sure if I am having the same issue with celery==4.4.6 and kombu==4.6.11 or if I need to make a new ticket? I have a chord that randomly stops, then when I check with inspect I get this error. Using rabbitmq as the broker and redis as the backend.

erewok commented 3 years ago

Also seeing this issue with these versions:

I was thinking that the comment about it being fixed in latest kombu meant that the v5 versions would work?

lemig commented 3 years ago

Prematuraly closed IMO. Issue not solved by 4.6.5.

erewok commented 3 years ago

I am actually thinking there's a celery change that is involved in this. I don't think it's a kombu issue (at least in my case). For instance, I am looking at two applications with the following versions:

App1: Works

App2: Does not work

Notably, the major difference above is the minor version change in celery, but they're using the same kombu version. In addition, App2 used to work before we bumped our redis version. In fact, it still works with redis 5.0.7.

I've been trying to tweak settings and see if I can figure out what will make it ping again in 4.4.7:

WORKING APP1 (celery 4.4.6):

>>> my_celery_app.control.ping()
[{'celery@my-celery-app-6d8c66b688-p5tf2': {'ok': 'pong'}}]

Non-WORKING APP2 (celery 4.4.7):

>>> my_celery_app.control.ping()
[]
auvipy commented 3 years ago

it was closed by a pr. that was reverted later but we didn't reopen this issue.

auvipy commented 3 years ago

I am actually thinking there's a celery change that is involved in this. I don't think it's a kombu issue (at least in my case). For instance, I am looking at two applications with the following versions:

App1: Works

* celery 4.4.6

* kombu 4.6.11

* redis 6.0.9

App2: Does not work

* celery 4.4.7

* kombu 4.6.11

* redis 6.0.9

Notably, the major difference above is the minor version change in celery, but they're using the same kombu version. In addition, App2 used to work before we bumped our redis version. In fact, it still works with redis 5.0.7.

I've been trying to tweak settings and see if I can figure out what will make it ping again in 4.4.7:

WORKING APP1 (celery 4.4.6):

>>> my_celery_app.control.ping()
[{'celery@my-celery-app-6d8c66b688-p5tf2': {'ok': 'pong'}}]

Non-WORKING APP2 (celery 4.4.7):

>>> my_celery_app.control.ping()
[]

you should try celery 5.0.5 or master with kombu. did you find out the commit that is the root for not working in 4.4.7?

erewok commented 3 years ago

I see the same behavior with celery 5.0.5. It seems odd that that (4.4.7) stack I mentioned above works on redis 5.0.7 but not on redis 6.0.9, all other things being equal. Should I open a ticket for celery proper? I'm happy to keep investigating: I looked at the diff between 4.4.6 and 4.4.7 yesterday and didn't see any telltale sign.

erewok commented 3 years ago

I need to investigate more. I just tried a clean environment with git checkout v4.4.7 && pip install -e '.[redis]' and it worked with redis-server 6.0.10, so there's something else going on.

nanijnv1 commented 3 years ago

If there are pending tasks in the queue before the update this is happening, need to figure out migrating queued tasks to a newer version.

If those tasks are not important just purge the queue and start the celery.

Bi0max commented 2 years ago

Hello, is this issue going to be solved in the next version of Celery? I'm still using Celery==4.3.0 and kombu == 4.6.3 to avoid this error.

auvipy commented 2 years ago

can you try kombu 5.2 and check which change creates this issue

lyf2000 commented 2 years ago

For

Commands

Sometimes are giving Error: No nodes replied within time constraint./[] and sometimes * {'id': '...', ...}/[{'celery@b9a47f2b2668': {'ok': 'pong'}}, ...]

darkhipo commented 2 years ago

We are getting the same issue.

We get Error: No nodes replied within time constraint. every time. We deploy the same on packages on some different envs and do not get this problem, this problem pops up for us only on our prod environment where we have many scheduled tasks.

darkhipo commented 2 years ago

We are getting the same issue.

* kombu==4.6.11

* celery==4.4.6

* redis_version:5.0.6

We get Error: No nodes replied within time constraint. every time. We deploy the same on packages on some different envs and do not get this problem, this problem pops up for us only on our prod environment where we have many scheduled tasks.

We were able to resolve this issue by deleting the key _kombu.binding.reply.celery.pidbox from our Redis. It had been left over from a previous deployment using celery 5 and its presence was causing many of the celery remote inspect diagnostics commands to fail (as well as other commands).

With

With the above version configuration we were able to achieve the PING and other inspect commands were functioning as expected and desired by keeping ONLY the following kombu related keys in our Redis backend.

jokiefer commented 10 months ago

We are getting an equivalent issue by using the following python package versions:

celery==5.3.6
redis==5.0.1
kombu==5.3.4

Redis broker backend version 7.2.3-alpine docker container.

If i start my worker it starts working fine, but after a while the worker simply stop working and no longer response by command celery -A MrMap status. Strace does also not printing any errors. The process simply stock and no longer does anything. No Error message nothing.

In my case there are about ~400000 tasks which are scheduled.

How can i debug this ? I tried downgrading all python packages and redis server version and also migrate from redis to rabbitmq. But nothing worked for me. Also the `"_kombu.binding.reply.celery.pidbox" key does not fix my problem.

rbehal commented 5 months ago

Same issue:

celery[redis]==5.2.3
redis==5.0.4
kombu==5.3.7

Locally everything works perfectly fine. When deploying with k8s, and viewing the celery logs, all looks good

[2024-05-01 04:28:18,881: INFO/MainProcess] Connected to redis://redis:6379/0
[2024-05-01 04:28:18,890: INFO/MainProcess] mingle: searching for neighbors
[2024-05-01 04:28:19,904: INFO/MainProcess] mingle: all alone
[2024-05-01 04:28:19,923: INFO/MainProcess] celery@pl-execution-service-676748ff94-bfnwl ready.

However, the readiness/liveness probes using the inspect command are failing. And when execing into the container, and running the inspect commands, it is failing.

celery inspect ping -d celery@$(hostname)
celery -A app inspect active

both return

Error: No nodes replied within time constraint

Adding timeout doesn't help.

I've checked hostname resolves correctly.

Edit: I found the issue, and it’s unrelated to this bug, but still commenting here in case others find this post like I did. The problem was in my production environment I had 3 replicas of Redis, however I had not configured Redis properly to act as a cluster. As such, whenever I sent the inspect commands, there was no consistency whether it was relayed through the correct instance of Redis for any given Celery node. Likewise, even if it was received and the worker sent a reply back, there was no consistency that this reply would be sent to the right place. However, task execution worked pretty much every time since it didn’t matter which worker picked up or which redis instance it was pushed to, just that it was. I scaled down to 1 redis replica and removed the HPA and that fixed the issue. For the future, I will properly configure Redis to act as a cluster when I begin to add more pods.