Infrastructure alert needs

OpenHistoricalMap / issues

File your issues here, regardless of repo until we get all our repos squared away; we don't want to miss anything.

Creative Commons Zero v1.0 Universal

18 stars 1 forks source link

Infrastructure alert needs #507

Open danrademacher opened 1 year ago

danrademacher commented 1 year ago

We have had some recurring issues with the tiles and missing replications, as noted over in #501.

We have these alert policies on New Relic:

Policy	Conditions	Open issues
Kubernetes default alert policy	7	0
OHM-Deploy Tiler-imposm	1	0
OSM-Seed Minute Replication-files	1	0
Site verification - openhistoricalmap.org	1	0
Site verification - tasks.openhistoricalmap.org	1	0

Since the tiler issues in #501 have been traced to missing replication files, it seems like our "OSM-Seed Minute Replication-files" alerting in New Relic is not working as expected.

In addition to New Relic, we have simple Uptime Robot service status tests over at https://stats.uptimerobot.com/0BBDoIkXKJ

danrademacher commented 1 year ago

the OSM-Seed Minute Replication Files query looks lie this:

SELECT count(*) FROM Log WHERE `message` LIKE '%s3://osmseed-staging/replication/minute/state.txt%' AND `cluster_name`='osmseed-production'

I think this means that if that returns nothing for 5 minutes, then we should get an alert:

I wonder if this is the issue:

I followed the docs to allow for a new "lost signal" alert:

That might resolve the specific issue with lost replications not alerting us.

Rub21 commented 6 months ago

New Relic is bee used for evaluation of the url services not for resourcing, and on the other hand we are running Prometheus in both the production and staging clusters. Prometheus is a useful tool for monitoring usage and limits within the cluster, aiding in the evaluation of resource usage for nodes, pods, and databases. However, what is missing is the alerting component, which can be reviewed at: Prometheus Alerting.

The main issue has been addressed by https://github.com/OpenHistoricalMap/issues/issues/573. Currently, we are not facing many issues related to this, but it is something to discuss with Sanjay to implement the alerting section.