Closed rthille closed 5 years ago
can you verify on top of master? and send a PR with proper tests?
I can try. Not sure the best way to mock out the connection and get it to fail at the right part of the loop. I tried testing with a real celery app and SQS and using tcpkill to kill the connection and it didn't result in the issue. If you've got any pointers to get started (other than the Contributing.rst doc I've found), that'd be helpful as this is my first time looking at the internals of Celery.
I think the actual problem here is that http status codes 4XX and 5XX are unhandled in kombu. In this case 599 can be translated to malformed response or request timeout but it's quite common to get 500 or 503 from the AWS API or 403 in case of non existing resource or lack of proper IAM permissions.
I applied this patch, ran a load tests (easiest way to trigger 5XX responses from AWS API) but it didn't help much. Instead of of generic exception now I got ConnectionError but the final outcome is still Unrecoverable error.
EDIT:
Well, setting broker_connections_retry to True (my bad) definitely helped for Unrecoverable error
but now worker return amqp.exceptions.ConnectionError: Request Empty body HTTP 599 Unknown SSL protocol error in connection to eu-west-1.queue.amazonaws.com:443 (None)
and hangs.
EDIT2: Alright, so all non 200 http responses lands here https://github.com/celery/kombu/blob/110dc10cbce82eb5f2402ae717d45b2a2100e634/kombu/asynchronous/aws/connection.py#L255 @auvipy could you please suggest how this execution should be handled?
Any news on this issue?
This seems to be fixed in 4.2. On Thu, Jan 3, 2019 at 3:47 AM Erwin Rossen notifications@github.com wrote:
Any news on this issue?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/celery/kombu/issues/931#issuecomment-451109523, or mute the thread https://github.com/notifications/unsubscribe-auth/ABmWWzg9J-e7uoRbU8HWFX69zGElqGrjks5u_d-3gaJpZM4XimJi .
I am still facing the issue using kombu 4.2.0.
The error still there (4.2.0)
Guys, any idea on stable release with this fix?
4.2.1 "fixed" it in that now the worker dies when the connection is lost, so it may be restarted more easily.
I'm using celery==4.3.0 and kombu==4.5.0 but it died with the following error and didn't restart on its own.
[2019-09-07 17:11:25,492: CRITICAL/MainProcess] Unrecoverable error: Exception('Request HTTP Error HTTP 503 Service Unavailable (b\'<?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Receiver</Type><Code>ServiceUnavailable</Code><Detail/></Error><RequestId>1e3c6057-164e-5afd-9e5c-35baf0b31d67</RequestId></ErrorResponse>\')',)
It was raised from kombu,
File "/home/....../lib/python3.6/site-packages/kombu/asynchronous/aws/connection.py", line 245, in _on_list_ready raise self._for_status(response, response.read())
with latest kombu release?
Currently I've kombu==4.5.0 installed. Should I try 4.6.4?
please try and let us know
Its been 6 days using version 4.6.4 and its still working. Will report if this happens again :+1:
For anyone hitting this with Django & Celery, make sure you define Kombu and Pycurl as dependencies. More info here
Filing this in Kombu, because I think the fix should be here, not in Celery, but I started creating the issue in Celery, so I've attached the info requested for issues there.
Checklist
celery -A proj report
in the issue. (if you are not able to do this, then at least specify the Celery version affected).master
branch of Celery.Steps to reproduce
Have long-running Celery App with SQS connection with Celery 4.2.1 & Kombu 4.2.1 Have AWS SQS have an issue an prematurely close the connection at the wrong time. Similar to these issues: https://github.com/celery/kombu/issues/796 and https://github.com/celery/celery/issues/3990 (which I commented on), however the process did not appear to attempt to re-establish the connection.
Expected behavior
Celery app handles exception and reconnects to SQS or exits (and is restarted by external systems).
Actual behavior
Exception is logged, but no further SQS messages are processed by the celery app, and no further logs were produced until another developer kill -HUP'd the main worker process. Here's the traceback logged:
Attempting to reproduce the error with 'tcpkill' results in a different traceback, which makes it all the way up to the top level and results in a process exit:
It seems that Kombu's asynchronous/aws/connection.py:_for_status function should raise ConnectionError on
response.status == 599
, rather than a generic Exception, and hub.py should handle ConnectionError by raising. I've got a patch against the 4.2.1 tag:but I haven't been able to reliably reproduce, so I'm not sure this is a good fix or not.
Celery Report output (company name and a few other things replaced with
REDACTED
):