Open danrademacher opened 1 year ago
the OSM-Seed Minute Replication Files query looks lie this:
SELECT count(*) FROM Log WHERE `message` LIKE '%s3://osmseed-staging/replication/minute/state.txt%' AND `cluster_name`='osmseed-production'
I think this means that if that returns nothing for 5 minutes, then we should get an alert:
I wonder if this is the issue:
I followed the docs to allow for a new "lost signal" alert:
That might resolve the specific issue with lost replications not alerting us.
New Relic is bee used for evaluation of the url services not for resourcing, and on the other hand we are running Prometheus in both the production and staging clusters. Prometheus is a useful tool for monitoring usage and limits within the cluster, aiding in the evaluation of resource usage for nodes, pods, and databases. However, what is missing is the alerting component, which can be reviewed at: Prometheus Alerting.
The main issue has been addressed by https://github.com/OpenHistoricalMap/issues/issues/573. Currently, we are not facing many issues related to this, but it is something to discuss with Sanjay to implement the alerting section.
We have had some recurring issues with the tiles and missing replications, as noted over in #501.
We have these alert policies on New Relic:
Since the tiler issues in #501 have been traced to missing replication files, it seems like our "OSM-Seed Minute Replication-files" alerting in New Relic is not working as expected.
In addition to New Relic, we have simple Uptime Robot service status tests over at https://stats.uptimerobot.com/0BBDoIkXKJ