canonical / prometheus-juju-exporter

GNU General Public License v3.0
2 stars 8 forks source link

Exporter disconnected from the controller will serve stale metrics forever and won't reconnect to the controller #42

Closed przemeklal closed 1 year ago

przemeklal commented 1 year ago

I have had an instance of prometheus-juju-exporter hanging since a few days ago after losing connection to the controller:

snroot@juju-f20ba5-1-lxd-10:~# snap logs -f prometheus-juju-exporter 
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Task exception was never retrieved
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: future: <Task finished name='Task-64395' coro=<Connection.reconnect() done, defined at /snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py:732> exception=JujuConnectionError('Unable to connect to any endpoint: <REDACTED_IP>:17070')>
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Traceback (most recent call last):
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 741, in reconnect
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     res = await connector(
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 819, in _connect_with_login
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     await self._connect(endpoints)
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 784, in _connect
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     raise errors.JujuConnectionError(
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: juju.errors.JujuConnectionError: Unable to connect to any endpoint: <REDACTED_IP>:17070

However, the exporter still happily serves stale metrics on :5000/metrics:

root@juju-f20ba5-1-lxd-10:~# curl -s localhost:5000/metrics | head -n4
# HELP juju_machine_state Running status of juju machines
# TYPE juju_machine_state gauge
juju_machine_state{cloud_name="REDACTED",customer="REDACTED",hostname="juju-controller1",job="prometheus-juju-exporter",juju_model="controller",type="kvm"} 1.0
juju_machine_state{cloud_name="REDACTED",customer="REDACTED",hostname="REDACTED",job="prometheus-juju-exporter",juju_model="openstack",type="metal"} 1.0
root@juju-f20ba5-1-lxd-10:~# curl -s localhost:5000/metrics | wc -l
198

Steps to reproduce:

  1. Start p-j-e.
  2. Stop juju controller. (Or do any other action that impacts the connection - in our case it was a failed model migration)
  3. Wait for the connection error in p-j-e.
  4. Curl p-j-e endpoint. It still shows data.
  5. Start the controller, add or remove machines in juju.
  6. Curl p-j-e endpoint. It still returns gauges collected before losing the connection.

Expected behaviour: Prometheus-juju-exporter should try to reconnect or stop showing any data when it's not connected to the controller. It should never serve stale data. The cache should be invalidated at some point.

przemeklal commented 1 year ago

More logs spanning across 1) when it worked 2) when it stopped working 3) when I manually restarted it today:

2023-03-29T10:35:32Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: 2023-03-29 10:35:32,253 INFO - Gauges collected and ready for exporting.
2023-03-29T10:50:32Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: 2023-03-29 10:50:32,321 INFO - Collecting gauges...
2023-03-29T10:50:34Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: unknown facade EnvironUpgrader
2023-03-29T10:50:34Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: unexpected facade EnvironUpgrader found, unable to decipher version to use
2023-03-29T10:50:50Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: unknown facade EnvironUpgrader
2023-03-29T10:50:50Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: unexpected facade EnvironUpgrader found, unable to decipher version to use
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Task exception was never retrieved
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: future: <Task finished name='Task-55147' coro=<Model._watch.<locals>._all_watcher() done, defined at /snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/model.py:1058> exception=JujuAPIError('model migration in progress')>
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Traceback (most recent call last):
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/model.py", line 1061, in _all_watcher
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     model_config = await self.get_config()
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/model.py", line 2017, in get_config
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     result = await config_facade.ModelGet()
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/facade.py", line 483, in wrapper
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     reply = await f(*args, **kwargs)
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/_client2.py", line 10431, in ModelGet
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     reply = await self.rpc(msg)
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/facade.py", line 656, in rpc
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     result = await self.connection.rpc(msg, encoder=TypeEncoder)
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 627, in rpc
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     raise errors.JujuAPIError(result)
2023-03-29T10:50:51Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: juju.errors.JujuAPIError: model migration in progress
2023-03-29T14:03:41Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Receiver: Connection closed, reconnecting
2023-03-29T14:03:41Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Receiver: Connection closed, reconnecting
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Task exception was never retrieved
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: future: <Task finished name='Task-64394' coro=<Connection.reconnect() done, defined at /snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py:732> exception=JujuConnectionError('Unable to connect to any endpoint: redacted:17070')>
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Traceback (most recent call last):
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 741, in reconnect
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     res = await connector(
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 819, in _connect_with_login
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     await self._connect(endpoints)
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 784, in _connect
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     raise errors.JujuConnectionError(
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: juju.errors.JujuConnectionError: Unable to connect to any endpoint: redacted:17070
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Task exception was never retrieved
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: future: <Task finished name='Task-64395' coro=<Connection.reconnect() done, defined at /snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py:732> exception=JujuConnectionError('Unable to connect to any endpoint: redacted:17070')>
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: Traceback (most recent call last):
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 741, in reconnect
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     res = await connector(
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 819, in _connect_with_login
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     await self._connect(endpoints)
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:   File "/snap/prometheus-juju-exporter/31/lib/python3.10/site-packages/juju/client/connection.py", line 784, in _connect
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]:     raise errors.JujuConnectionError(
2023-03-29T14:04:42Z prometheus-juju-exporter.prometheus-juju-exporter[12726]: juju.errors.JujuConnectionError: Unable to connect to any endpoint: redacted:17070
2023-04-03T08:49:48Z systemd[1]: Stopping Service for snap application prometheus-juju-exporter.prometheus-juju-exporter...
2023-04-03T08:49:48Z systemd[1]: Stopped Service for snap application prometheus-juju-exporter.prometheus-juju-exporter.

Also, this type of silent failure makes it tricky to monitor since the only way to find this out is by checking the logs.

agileshaw commented 1 year ago

After further discussion and investigation, the error doesn't not occur when controller is stopped as stated in the original description. Rather, it occurs when prometheus-juju-exporter tries to access a model that is in the middle of a migration (one notable scenario is migration failure). I was able to reproduce the issue with the following steps:

  1. Start prometheus-juju-exporter with the debug mode turned on
  2. Produce a failed migration model with bug LP#1983506 after the prometheus-juju-exporter starts
  3. Check prometheus-juju-exporter logs. It would show the error messages below and the snap hangs
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]: 2023-04-17 19:08:32,734 DEBUG - Checking model '<broken_model>'...
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]: unknown facade EnvironUpgrader
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]: unexpected facade EnvironUpgrader found, unable to decipher version to use
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]: Task exception was never retrieved
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]: future: <Task finished name='Task-4273' coro=<Model._watch.<locals>._all_watcher() done, defined at /snap/prometheus-juju-exporter/37/lib/python3.10/site-packages/juju/model.py:1058> exception=JujuAPIError('model migration in progress')>
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]: Traceback (most recent call last):
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:   File "/snap/prometheus-juju-exporter/37/lib/python3.10/site-packages/juju/model.py", line 1061, in _all_watcher
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:     model_config = await self.get_config()
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:   File "/snap/prometheus-juju-exporter/37/lib/python3.10/site-packages/juju/model.py", line 2017, in get_config
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:     result = await config_facade.ModelGet()
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:   File "/snap/prometheus-juju-exporter/37/lib/python3.10/site-packages/juju/client/facade.py", line 483, in wrapper
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:     reply = await f(*args, **kwargs)
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:   File "/snap/prometheus-juju-exporter/37/lib/python3.10/site-packages/juju/client/_client2.py", line 10431, in ModelGet
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:     reply = await self.rpc(msg)
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:   File "/snap/prometheus-juju-exporter/37/lib/python3.10/site-packages/juju/client/facade.py", line 656, in rpc
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:     result = await self.connection.rpc(msg, encoder=TypeEncoder)
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:   File "/snap/prometheus-juju-exporter/37/lib/python3.10/site-packages/juju/client/connection.py", line 627, in rpc
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]:     raise errors.JujuAPIError(result)
    2023-04-17T19:08:32Z prometheus-juju-exporter.prometheus-juju-exporter[7405]: juju.errors.JujuAPIError: model migration in progress
  4. Add or remove machines in juju.
  5. Curl prometheus-juju-exporter endpoint. It still returns gauges collected before losing the connection.

Edit: I'm also able to reproduce the issue by running a simple libjuju script connecting to a model in the middle of a successful migration.

agileshaw commented 1 year ago

Had a discussion with Juju team and it seems that this is a bug in python-libjuju: it is not correctly communicating the exception back to the main thread. Opened a bug report https://github.com/juju/python-libjuju/issues/829 in libjuju repo.

No changes needed at p-j-e side. But since we may need to update libjuju dependency version in requirements.txt once the fix gets released, I'm keeping the issue open for the time being.

agileshaw commented 1 year ago

The new libjuju version containing the fix doesn't work with juju controller <2.9 (as reported in #51), so we had to revert the change. We will include the fix in a future release after figuring out the best way to deal with the juju version compatibility. Reopening the issue.