Closed rmoesbergen closed 6 years ago
The broker.heartbeat
is set to 120 secs by default does that not work. Please try setting it to 60 in /etc/baruwa/production.ini and restarting the baruwa service ?
I tried that at first, but it did not help. I could not find a way to check if that setting was actually picked up. Also: the setting only seems to work with the "pyamqp" transport, and celery is now configured to use the (deprecated) 'amqp' backend.
amqp is just an alias which either uses pyamqp or librabbitmq so it should work https://github.com/celery/kombu/issues/189
We will investigate this further but it maybe hard to replicate on our side.
I think the issue is caused by our amqp package being too old.
Supporting option 2 will require a whole celery stack upgrade, we would rather avoid that. Please let us know how the TCP keepalive option works for you and we will consider implementing that instead.
I can confirm that the keepalive adjustments have worked. I tested logins on all clusters this morning, and they were working just fine and gave a nice green 'OK' in the top status bar.
Thanks implemented.
Please rename /etc/sysctl.d/keepalive.conf
to /etc/sysctl.d/rabbitmq.conf
to avoid duplication when you upgrade.
Will do, thanks!
One more finding: We had one cluster which still didn't work after the keepalive adjustments. These were behind Palo-Alto firewalls. Palo Alto's offload SSL traffic to hardware. When this happens, the statistics for the connection/session (such as ttl timeout) are only updated after every 16th packed sent/received. Furthermore, the ssl application has a 1800 second session timeout (instead of the default 3600 for other tcp traffic). With the keepalive kernel settings proposed above, the connection would still be killed. So I now use this and that seems to work, even on palo-alto firewalls:
net.ipv4.tcp_keepalive_time = 60 net.ipv4.tcp_keepalive_intvl = 5 net.ipv4.tcp_keepalive_probes = 3
Thanks for the update, we have updates our manifests
We have several clusters that have a stateful firewall between the frontend and the backend nodes. These firewalls keep a TCP session for about an hour after it goes idle. After that time, the firewall 'forgets' the session, and traffic for it is then dropped. The celery daemon seems to establish a tcp connection to rabbitmq, and when users are logged in to the web interface, it receives a 'get-system-status' message every now and then, keeping the connection alive. However, when no-one uses the webinterface for > 1 hour, the celery connection is idle, and the firewall times out the session. Subsequent MQ messages are queued, but never received by celery, and celery doesn't seem to notice that the connection is dead. It even hangs, where a normal kill -15 doesn't work, only kill -9 can get rid of the stale process. Once it is restarted, all queued messages are received and processed.
This results in the GUI displaying a red 'ERROR' in the status bar on top, and delays and errors when logging in. Also, configuration changes are not pushed to the fronted nodes.
I see two possible solutions (but there may be more, ofcourse):
I think the second options is the nicest, but I was unable to do it in configuration. So for now I have implemented option 1, to see if the problem goes away, will report the results.
I've added this to /etc/rabbitmq/rabbitmq.config: {tcp_listen_options, [{keepalive, true}]},
Then created: /etc/sysctl.d/keepalive.conf net.ipv4.tcp_keepalive_time = 300 net.ipv4.tcp_keepalive_intvl = 60 net.ipv4.tcp_keepalive_probes = 3
and ran sysctl -p /etc/sysctl.d/keepalive.conf