canonical / cos-proxy-operator

https://charmhub.io/cos-proxy
Apache License 2.0
2 stars 12 forks source link

Prepared update for prometheus #139

Open pengwyn opened 5 months ago

pengwyn commented 5 months ago

Issue

For large clouds, the hooks run for monitors-relation-changed can take a very long time. In some cases, it takes around 3 hours for one hook to finish, and several days for the entire set to complete. While this is most common during the initial charm deployment, it can also be triggered by a NRPE charm refresh.

This change should fix #56.

Solution

After investigating the source of the slowness on one cloud, I found it was due to the lines of code where the relation data is set, for example:

relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs)

Executing each of these lines takes 3-5s. This happens hundreds or thousands of times during a single hook, so that time quickly adds up. My solution is to allow a "prepared update" context manager to be used:

with self.metrics_aggregator.as_prepared_update():

which caches the changes to the relation data and only applies at the end of the context manager.

I have made these changes to the prometheus_scrap.py file in the library. I realise this is a dependency, and probably should go into the upstream repo for this change. However, I'd like to see if this change makes sense in the context of cos-proxy-operator first.

Testing Instructions

The changes pass the basic tox tests. In addition, I have deployed a small cluster locally and tested the output of:

juju exec --unit cos-proxy/leader -- relation-get -r $(juju exec --unit cos-proxy/leader -- relation-ids monitors) - cos-proxy --app

between the output before and after the change. In the two cases, I did a completely new deployment of cos-proxy, so there would be no contamination of the relation data. I found that there were no differences in the relation data.

sed-i commented 4 months ago

Thanks @pengwyn. Per our discussion on matrix,

  1. As you correctly pointed out, we do need to cache relation writes, because ops doesn't. Let's see if we need to roll our own or if this may end up on ops roadmap.
  2. Prometheus owns the scrape lib, so if we roll our own, the change would need to be done there.