Closed nabadger closed 4 years ago
Another option here would be to introduce a health check, but I'm not sure what that would look like (if we had such a check, k8s or an external system could use it to determine whether it needs to restart the pod/container or not)
We have the same problem. Both our celery-exporter instances stopped capturing celery metrics after a connection error. Celery-exporter process (inside a docker container in our case) still runs, so we have to manually restart the container to obtain metrics again.
Last log lines:
[2020-08-10 13:14:13,288] task-thread:ERROR: Connection failed
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 61, in _monitor
recv.capture(limit=None, timeout=None, wakeup=True)
File "/usr/local/lib/python3.6/site-packages/celery/events/receiver.py", line 93, in capture
for _ in self.consume(limit=limit, timeout=timeout, wakeup=wakeup):
File "/usr/local/lib/python3.6/site-packages/kombu/mixins.py", line 197, in consume
conn.drain_events(timeout=safety_interval)
File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 323, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
return connection.drain_events(**kwargs)
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 505, in drain_events
while not self.blocking_read(timeout):
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 511, in blocking_read
return self.on_inbound_frame(frame)
File "/usr/local/lib/python3.6/site-packages/amqp/method_framing.py", line 55, in on_frame
callback(channel, method_sig, buf, None)
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 518, in on_inbound_method
method_sig, payload, content,
File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
listener(*args)
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 648, in _on_close
(class_id, method_id), ConnectionError)
amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
After restart of the celery-exporter container (docker restart
), yet another log lines are visible via docker logs
:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 61, in _monitor
recv.capture(limit=None, timeout=None, wakeup=True)
File "/usr/local/lib/python3.6/site-packages/celery/events/receiver.py", line 93, in capture
for _ in self.consume(limit=limit, timeout=timeout, wakeup=wakeup):
File "/usr/local/lib/python3.6/site-packages/kombu/mixins.py", line 197, in consume
conn.drain_events(timeout=safety_interval)
File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 323, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
return connection.drain_events(**kwargs)
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 505, in drain_events
while not self.blocking_read(timeout):
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 511, in blocking_read
return self.on_inbound_frame(frame)
File "/usr/local/lib/python3.6/site-packages/amqp/method_framing.py", line 55, in on_frame
callback(channel, method_sig, buf, None)
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 518, in on_inbound_method
method_sig, payload, content,
File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
listener(*args)
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 648, in _on_close
(class_id, method_id), ConnectionError)
amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 32, in run
self._monitor()
File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 64, in _monitor
setup_metrics(self._app, self._namespace)
File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 122, in setup_metrics
for name, labels, cnt in metric.samples:
ValueError: too many values to unpack (expected 3)
Last log lines from a second celery-exporter instance (in this case the container was not restarted yet):
[2020-08-10 13:09:08,900] workers-thread:ERROR: Error while pinging workers
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 86, in update_workers_count
len(self._app.control.ping(timeout=self.celery_ping_timeout_seconds))
File "/usr/local/lib/python3.6/site-packages/celery/app/control.py", line 267, in ping
timeout=timeout, **kwargs)
File "/usr/local/lib/python3.6/site-packages/celery/app/control.py", line 480, in broadcast
limit, callback, channel=channel,
File "/usr/local/lib/python3.6/site-packages/kombu/pidbox.py", line 346, in _broadcast
matcher=matcher)
File "/usr/local/lib/python3.6/site-packages/kombu/pidbox.py", line 302, in _publish
maybe_declare(self.reply_queue(channel))
File "/usr/local/lib/python3.6/site-packages/kombu/common.py", line 121, in maybe_declare
return _maybe_declare(entity, channel)
File "/usr/local/lib/python3.6/site-packages/kombu/common.py", line 161, in _maybe_declare
entity.declare(channel=channel)
File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 608, in declare
self._create_exchange(nowait=nowait, channel=channel)
File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 615, in _create_exchange
self.exchange.declare(nowait=nowait, channel=channel)
File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 186, in declare
nowait=nowait, passive=passive,
File "/usr/local/lib/python3.6/site-packages/amqp/channel.py", line 614, in exchange_declare
wait=None if nowait else spec.Exchange.DeclareOk,
File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 68, in send_method
return self.wait(wait, returns_tuple=returns_tuple)
File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 88, in wait
self.connection.drain_events(timeout=timeout)
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 505, in drain_events
while not self.blocking_read(timeout):
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 511, in blocking_read
return self.on_inbound_frame(frame)
File "/usr/local/lib/python3.6/site-packages/amqp/method_framing.py", line 55, in on_frame
callback(channel, method_sig, buf, None)
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 518, in on_inbound_method
method_sig, payload, content,
File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
listener(*args)
File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 648, in _on_close
(class_id, method_id), ConnectionError)
amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
To me it seems like some sort of a deadlock
Hi,
We run celery-exporter inside Kubernetes and communicate against a redis-broker.
We sometimes see failures like so:
Which looks like it hits some kind of issue, and then enters the recovery loop:
https://github.com/OvalMoney/celery-exporter/blob/master/celery_exporter/monitor.py#L63
At this point, the container remains up (and can communicate with it), but the metrics returned are always zero, so it does not capture events after this issue.
The underlying issue is not that important, and could be anything (i.e. network issue, pod restart, new rollouts etc).
I think either within this app, or within the scope of Kubernetes, there is an expectation of self-recovery.
If this issue were to propagate up to the parent process, and the main process died, that would actually be a good thing (at least for K8s), since the pod would get restarted.
If I restart the pod manually (which is my only workaround at the moment), things work as expected.
My thinking is that https://github.com/OvalMoney/celery-exporter/blob/master/celery_exporter/monitor.py#L64 is not working as expected...