canonical / prometheus-juju-exporter

GNU General Public License v3.0
2 stars 8 forks source link

Huge increase in memory usage on juju controller 2.9.38.1 that eventually leads to an OOM kill of jujud on the controller #27

Closed przemeklal closed 1 year ago

przemeklal commented 1 year ago

We see a memory leak (huge memory usage spikes) when p-j-e talks to juju 2.9.38.1 controller.

Logs:

2023-02-23T15:43:39Z systemd[1]: Started Service for snap application prometheus-juju-exporter.prometheus-juju-exporter.
2023-02-23T15:43:40Z prometheus-juju-exporter.prometheus-juju-exporter[13779]: 2023-02-23 15:43:40,223 INFO - Configuration parsed successfully
2023-02-23T15:43:40Z prometheus-juju-exporter.prometheus-juju-exporter[13779]: 2023-02-23 15:43:40,223 INFO - Parsed config: /var/snap/prometheus-juju-exporter/3
2023-02-23T15:43:40Z prometheus-juju-exporter.prometheus-juju-exporter[13779]: 2023-02-23 15:43:40,223 INFO - Configuration parsed successfully
2023-02-23T15:43:40Z prometheus-juju-exporter.prometheus-juju-exporter[13779]: 2023-02-23 15:43:40,227 INFO - Collecting gauges...
2023-02-23T15:43:40Z prometheus-juju-exporter.prometheus-juju-exporter[13779]: unknown facade EnvironUpgrader
2023-02-23T15:43:40Z prometheus-juju-exporter.prometheus-juju-exporter[13779]: unexpected facade EnvironUpgrader found, unable to decipher version to use
2023-02-23T15:43:43Z prometheus-juju-exporter.prometheus-juju-exporter[13779]: unknown facade EnvironUpgrader
2023-02-23T15:43:43Z prometheus-juju-exporter.prometheus-juju-exporter[13779]: unexpected facade EnvironUpgrader found, unable to decipher version to use
2023-02-23T15:43:44Z systemd[1]: Stopping Service for snap application prometheus-juju-exporter.prometheus-juju-exporter...
2023-02-23T15:43:44Z systemd[1]: snap.prometheus-juju-exporter.prometheus-juju-exporter.service: Succeeded.
2023-02-23T15:43:44Z systemd[1]: Stopped Service for snap application prometheus-juju-exporter.prometheus-juju-exporter.

Memory usage grows rapidly after starting the exporter both on jujud and the exporter itself and eats all available memory.

Please note that it never reaches this message:

2023-02-23T15:43:44Z prometheus-juju-exporter.prometheus-juju-exporter[6103]: 2023-02-23 15:43:44,077 INFO - Gauges collected and ready for exporting.

With --debug flag I noticed that it reaches 3rd model and the memory starts leaking. There are 5 models in total on this controller.

przemeklal commented 1 year ago

I just revoked access to this "bad" model and it was able to collect metrics from the remaining 4 models without any issues.

przemeklal commented 1 year ago

Please find logs and pmap output for jujud and prometheus-juju-exporter: https://private-fileshare.canonical.com/~przemeklal/p-j-e-mem-issues-2.9.38.1.tar.gz

przemeklal commented 1 year ago

Just a note: I rebuilt the snap with juju==2.9.38.1 and the memory leak (memory consumption) issue is still there. It resolved the unknown facade EnvironUpgrader warnings though.

agileshaw commented 1 year ago

The log suggests that the process hangs when establishing connection to the "bad" model (Controller.get_model()). Root cause of this bug still needs to be determined by analyzing the characteristics of the "bad" model.

There's no obvious fix that can be implemented on the snap side. But a workaround idea, came out of the discussion, to avoid memory leak would be to run collector in a sub-process with timeout and memory cap. This way, the collector process would be killed when hitting limits whereas the main process remains intact.

przemeklal commented 1 year ago

@agileshaw Just one more thing - I was able to nearly OOM kill the controller by running the libjuju script on another machine. It doesn't even need prometheus-juju-exporter and juju controller to run on the same machine. I'll update the bug title accordingly.

agileshaw commented 1 year ago

After troubleshooting with Juju team, the root cause of this issue has been identified: the action-pruner engine on the “problematic” stopped working long time ago (possibly a side effect of juju upgrade), which left millions of action results data in juju’s database. When libjuju tries to establishes a connection to this model, it calls jujud on controller to retrieve all action results data, which gets loaded to jujud’s memory (hence the huge memory increase). The data is too big to send over to the client, so libjuju never receives it and the connection fails.

The juju team will further investigate the cause of action-pruner failure and get back to us with their findings.

Currently no action can be done from our side. We need to wait for https://github.com/juju/python-libjuju/pull/806 to land in libjuju 2.9 series, which would reduce the memory usage when establishing model connections. We can then bump the libjuju version in the snap to apply the changes.(This PR is closed because it would break the current interaction with the libjuju 2.9). We need to wait for juju team’s investigation result on action-pruner failure.

aieri commented 1 year ago

https://bugs.launchpad.net/juju/+bug/2009879 has been filed to improve monitoring in case the action-pruner becomes inactive. As no change is required on the prometheus-juju-exporter side I'll be closing this bug.