[Problem] Celeryd loses rabbitmq connection with stateful firewalling between frontend/backend nodes

rmoesbergen commented 6 years ago

We have several clusters that have a stateful firewall between the frontend and the backend nodes. These firewalls keep a TCP session for about an hour after it goes idle. After that time, the firewall 'forgets' the session, and traffic for it is then dropped. The celery daemon seems to establish a tcp connection to rabbitmq, and when users are logged in to the web interface, it receives a 'get-system-status' message every now and then, keeping the connection alive. However, when no-one uses the webinterface for > 1 hour, the celery connection is idle, and the firewall times out the session. Subsequent MQ messages are queued, but never received by celery, and celery doesn't seem to notice that the connection is dead. It even hangs, where a normal kill -15 doesn't work, only kill -9 can get rid of the stale process. Once it is restarted, all queued messages are received and processed.

This results in the GUI displaying a red 'ERROR' in the status bar on top, and delays and errors when logging in. Also, configuration changes are not pushed to the fronted nodes.

I see two possible solutions (but there may be more, ofcourse):

Enable tcp_keepalive in rabbitmq, and set the kernel/procfs tcp_keepalive options to sane values
- or -
Enable the celery hearbeat worker, that sends an application level (amqp) hearbeat every x seconds/minutes, to keep the connection alive.

I think the second options is the nicest, but I was unable to do it in configuration. So for now I have implemented option 1, to see if the problem goes away, will report the results.

I've added this to /etc/rabbitmq/rabbitmq.config: {tcp_listen_options, [{keepalive, true}]},

Then created: /etc/sysctl.d/keepalive.conf net.ipv4.tcp_keepalive_time = 300 net.ipv4.tcp_keepalive_intvl = 60 net.ipv4.tcp_keepalive_probes = 3

and ran sysctl -p /etc/sysctl.d/keepalive.conf

akissa commented 6 years ago

The broker.heartbeat is set to 120 secs by default does that not work. Please try setting it to 60 in /etc/baruwa/production.ini and restarting the baruwa service ?

rmoesbergen commented 6 years ago

I tried that at first, but it did not help. I could not find a way to check if that setting was actually picked up. Also: the setting only seems to work with the "pyamqp" transport, and celery is now configured to use the (deprecated) 'amqp' backend.

akissa commented 6 years ago

amqp is just an alias which either uses pyamqp or librabbitmq so it should work https://github.com/celery/kombu/issues/189

We will investigate this further but it maybe hard to replicate on our side.

akissa commented 6 years ago

I think the issue is caused by our amqp package being too old.

https://github.com/celery/py-amqp/issues/6

akissa commented 6 years ago

Supporting option 2 will require a whole celery stack upgrade, we would rather avoid that. Please let us know how the TCP keepalive option works for you and we will consider implementing that instead.

rmoesbergen commented 6 years ago

I can confirm that the keepalive adjustments have worked. I tested logins on all clusters this morning, and they were working just fine and gave a nice green 'OK' in the top status bar.

akissa commented 6 years ago

Thanks implemented.

akissa commented 6 years ago

Please rename /etc/sysctl.d/keepalive.conf to /etc/sysctl.d/rabbitmq.conf to avoid duplication when you upgrade.

rmoesbergen commented 6 years ago

Will do, thanks!

rmoesbergen commented 6 years ago

One more finding: We had one cluster which still didn't work after the keepalive adjustments. These were behind Palo-Alto firewalls. Palo Alto's offload SSL traffic to hardware. When this happens, the statistics for the connection/session (such as ttl timeout) are only updated after every 16th packed sent/received. Furthermore, the ssl application has a 1800 second session timeout (instead of the default 3600 for other tcp traffic). With the keepalive kernel settings proposed above, the connection would still be killed. So I now use this and that seems to work, even on palo-alto firewalls:

net.ipv4.tcp_keepalive_time = 60 net.ipv4.tcp_keepalive_intvl = 5 net.ipv4.tcp_keepalive_probes = 3

akissa commented 6 years ago

Thanks for the update, we have updates our manifests

baruwaproject / baruwa2

[Problem] Celeryd loses rabbitmq connection with stateful firewalling between frontend/backend nodes #139