canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.29k stars 923 forks source link

Smarter metrics caching #9881

Closed simondeziel closed 2 years ago

simondeziel commented 2 years ago

LXD servers keep a cached copy of the metrics for 15s to handle multiple scrapers. Since prometheus scrapes every ~15s by default, one risks getting the same results twice because the scrape interval has some jitter added to spread the load.

To mitigate that issue, our doc recommends overriding the job's scrape interval to be 30s. The problem with that is the resulting LXD metrics are then harder to correlate with metrics from other jobs using the default 15s. Because of that, I think it would be desirable to slightly modify the caching behavior of LXD's metrics. I'd propose:

1) when the cache is (re-)populated, note who's the "initial requestor" (either the IP, Unix UID or TLS fingerprint) 2) cache the resulting metrics as normal 3) when a cached copy is about to be used, check if: 3a) "requestor" == "initial requestor": build a new cache, overwrite the old one and serve the fresh copy 3b) "requestor" != "initial requestor": serve the cached entry

This way, a returning scraper will always have fresh data but multiple scrapers won't cause new metrics to be collected unless the data is actually stale (>15s).

Another way to put it, is that any prometheus scrape interval will work and for the multiple scrapers scenario, they will have up to 15s old metrics.

stgraber commented 2 years ago

I'm not sure that I like that so much, a prometheus gone wild can hit LXD in a loop and create a massive amount of load which doesn't seem desirable.

If our concern is compatibility with the 15s scrape interval, then reducing our cache to 12s or so is likely the better option here.

simondeziel commented 2 years ago

I've seen recommendations on using a scrape interval between 10-60s as recommended (and a maximum to avoid going above of 2m) so I think if we are to lower the duration we should go with ~8s, what do you think?