Open ringerc opened 5 months ago
I think Retry is a great idea. I've noticed that some key metric collection queries occasionally fail sporadically in large-scale production environments, causing monitoring graphs to show short blips.
The rough idea is to add a retry field to the Collector, allowing 1-3 retries. For those system-level critical metrics, we could set a default retry of 1-2 times with a fixed interval (let's say 100ms?) to avoid the aforementioned issue caused by sporadic errors. Of course, retries should only happen if the query fails due to specific retryable errors.
As for exponential backoff, given our current metric scraping method is linearly executed with a single connection, it might not be particularly beneficial. Maybe we can use a fixed value of 100ms as a starting point?
These are just my rough thoughts. I'm currently adapting the monitoring metrics for PostgreSQL 17, so if you're interested in adding any implementation, I'd be more than happy to collaborate.
Interesting - you've described a different problem, where a query sporadically fails but you want to retry it within one scrape.
I'm not concerned with metric gaps here. Rather I'm concerned about a query that's failing spamming the log with verbose errors. Especially if a configuration is deployed with a query that happens to break on some specific combination of postgres version, configuration, etc.
That's why I propose an exponential backoff where the query is temporarily disables (skipped) in future scrapes for some time period. If it's a transient failure the query will resume scraping later. If it's a persistent failure such as a query bug, eventually the query will just run very occasionally and log, so it's possible to see why it's not being collected, but it won't spam the log.
To do this I'm thinking of adding another non-persistent Query
field with a back-off duration. On first scrape error the field will be set to an initial back-off value from the query config or a global env-var/cli-arg supplied value as a fallback if unset. On subsequent scrape errors, the backoff is multiplied by (2+-small random value). Each scrape, the back-off is checked against the last executed timestamp. If we're still in a back-off period, the metric scrape is skipped. A new query metric will indicate failure backoff skip. If the backoff period has passed, the query will be re-run. On failure the backoff is increased as described. On success, the backoff is reset to 0.
Do you have any opinion on the idea of having exponential back-off on re-trying failed metric scrapes to reduce log-spam in case of problems?
If it's an idea you're open to I can look at cooking up a patch to support it if my initial PoC of this scraper works out.
CloudNative-PG's built-in scraper, which I currently use, doesn't do this either. But log-spam is a real problem with it if there's a mistake in a query. So it's something I'd like to see if I can implement here if I adopt this scraper.