Task thread can fail and not recover resulting in zero value metrics

nabadger commented 4 years ago

Hi,

We run celery-exporter inside Kubernetes and communicate against a redis-broker.

We sometimes see failures like so:

[2020-05-21 13:21:01,747] root:INFO: Starting HTTPD on 0.0.0.0:9540                                       
[2020-05-21 13:21:02,842] task-thread:ERROR: Connection failed

Which looks like it hits some kind of issue, and then enters the recovery loop:

https://github.com/OvalMoney/celery-exporter/blob/master/celery_exporter/monitor.py#L63

At this point, the container remains up (and can communicate with it), but the metrics returned are always zero, so it does not capture events after this issue.

The underlying issue is not that important, and could be anything (i.e. network issue, pod restart, new rollouts etc).

I think either within this app, or within the scope of Kubernetes, there is an expectation of self-recovery.

If this issue were to propagate up to the parent process, and the main process died, that would actually be a good thing (at least for K8s), since the pod would get restarted.

If I restart the pod manually (which is my only workaround at the moment), things work as expected.

My thinking is that https://github.com/OvalMoney/celery-exporter/blob/master/celery_exporter/monitor.py#L64 is not working as expected...

nabadger commented 4 years ago

Another option here would be to introduce a health check, but I'm not sure what that would look like (if we had such a check, k8s or an external system could use it to determine whether it needs to restart the pod/container or not)

itsx commented 4 years ago

We have the same problem. Both our celery-exporter instances stopped capturing celery metrics after a connection error. Celery-exporter process (inside a docker container in our case) still runs, so we have to manually restart the container to obtain metrics again.

Last log lines:

[2020-08-10 13:14:13,288] task-thread:ERROR: Connection failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 61, in _monitor
    recv.capture(limit=None, timeout=None, wakeup=True)
  File "/usr/local/lib/python3.6/site-packages/celery/events/receiver.py", line 93, in capture
    for _ in self.consume(limit=limit, timeout=timeout, wakeup=wakeup):
  File "/usr/local/lib/python3.6/site-packages/kombu/mixins.py", line 197, in consume
    conn.drain_events(timeout=safety_interval)
  File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 323, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
    return connection.drain_events(**kwargs)
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 505, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 511, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/local/lib/python3.6/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 518, in on_inbound_method
    method_sig, payload, content,
  File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 648, in _on_close
    (class_id, method_id), ConnectionError)
amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'

After restart of the celery-exporter container (docker restart), yet another log lines are visible via docker logs:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 61, in _monitor
    recv.capture(limit=None, timeout=None, wakeup=True)
  File "/usr/local/lib/python3.6/site-packages/celery/events/receiver.py", line 93, in capture
    for _ in self.consume(limit=limit, timeout=timeout, wakeup=wakeup):
  File "/usr/local/lib/python3.6/site-packages/kombu/mixins.py", line 197, in consume
    conn.drain_events(timeout=safety_interval)
  File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 323, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
    return connection.drain_events(**kwargs)
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 505, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 511, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/local/lib/python3.6/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 518, in on_inbound_method
    method_sig, payload, content,
  File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 648, in _on_close
    (class_id, method_id), ConnectionError)
amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 32, in run
    self._monitor()
  File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 64, in _monitor
    setup_metrics(self._app, self._namespace)
  File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 122, in setup_metrics
    for name, labels, cnt in metric.samples:
ValueError: too many values to unpack (expected 3)

Last log lines from a second celery-exporter instance (in this case the container was not restarted yet):

[2020-08-10 13:09:08,900] workers-thread:ERROR: Error while pinging workers
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/celery_exporter/monitor.py", line 86, in update_workers_count
    len(self._app.control.ping(timeout=self.celery_ping_timeout_seconds))
  File "/usr/local/lib/python3.6/site-packages/celery/app/control.py", line 267, in ping
    timeout=timeout, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/celery/app/control.py", line 480, in broadcast
    limit, callback, channel=channel,
  File "/usr/local/lib/python3.6/site-packages/kombu/pidbox.py", line 346, in _broadcast
    matcher=matcher)
  File "/usr/local/lib/python3.6/site-packages/kombu/pidbox.py", line 302, in _publish
    maybe_declare(self.reply_queue(channel))
  File "/usr/local/lib/python3.6/site-packages/kombu/common.py", line 121, in maybe_declare
    return _maybe_declare(entity, channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/common.py", line 161, in _maybe_declare
    entity.declare(channel=channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 608, in declare
    self._create_exchange(nowait=nowait, channel=channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 615, in _create_exchange
    self.exchange.declare(nowait=nowait, channel=channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 186, in declare
    nowait=nowait, passive=passive,
  File "/usr/local/lib/python3.6/site-packages/amqp/channel.py", line 614, in exchange_declare
    wait=None if nowait else spec.Exchange.DeclareOk,
  File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 68, in send_method
    return self.wait(wait, returns_tuple=returns_tuple)
  File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 88, in wait
    self.connection.drain_events(timeout=timeout)
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 505, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 511, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/local/lib/python3.6/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 518, in on_inbound_method
    method_sig, payload, content,
  File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.6/site-packages/amqp/connection.py", line 648, in _on_close
    (class_id, method_id), ConnectionError)
amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'

To me it seems like some sort of a deadlock

OvalMoney / celery-exporter

Task thread can fail and not recover resulting in zero value metrics #29