For large clouds, the hooks run for monitors-relation-changed can take a very long time. In some cases, it takes around 3 hours for one hook to finish, and several days for the entire set to complete. While this is most common during the initial charm deployment, it can also be triggered by a NRPE charm refresh.
This change should fix #56.
Solution
After investigating the source of the slowness on one cloud, I found it was due to the lines of code where the relation data is set, for example:
Executing each of these lines takes 3-5s. This happens hundreds or thousands of times during a single hook, so that time quickly adds up. My solution is to allow a "prepared update" context manager to be used:
with self.metrics_aggregator.as_prepared_update():
which caches the changes to the relation data and only applies at the end of the context manager.
I have made these changes to the prometheus_scrap.py file in the library. I realise this is a dependency, and probably should go into the upstream repo for this change. However, I'd like to see if this change makes sense in the context of cos-proxy-operator first.
Testing Instructions
The changes pass the basic tox tests. In addition, I have deployed a small cluster locally and tested the output of:
between the output before and after the change. In the two cases, I did a completely new deployment of cos-proxy, so there would be no contamination of the relation data. I found that there were no differences in the relation data.
As you correctly pointed out, we do need to cache relation writes, because ops doesn't. Let's see if we need to roll our own or if this may end up on ops roadmap.
Prometheus owns the scrape lib, so if we roll our own, the change would need to be done there.
Issue
For large clouds, the hooks run for monitors-relation-changed can take a very long time. In some cases, it takes around 3 hours for one hook to finish, and several days for the entire set to complete. While this is most common during the initial charm deployment, it can also be triggered by a NRPE charm refresh.
This change should fix #56.
Solution
After investigating the source of the slowness on one cloud, I found it was due to the lines of code where the relation data is set, for example:
Executing each of these lines takes 3-5s. This happens hundreds or thousands of times during a single hook, so that time quickly adds up. My solution is to allow a "prepared update" context manager to be used:
which caches the changes to the relation data and only applies at the end of the context manager.
I have made these changes to the
prometheus_scrap.py
file in the library. I realise this is a dependency, and probably should go into the upstream repo for this change. However, I'd like to see if this change makes sense in the context ofcos-proxy-operator
first.Testing Instructions
The changes pass the basic tox tests. In addition, I have deployed a small cluster locally and tested the output of:
between the output before and after the change. In the two cases, I did a completely new deployment of cos-proxy, so there would be no contamination of the relation data. I found that there were no differences in the relation data.