UWIT-UE / am2alertapi

Prometheus alertmanager to UW alertAPI
GNU General Public License v3.0
0 stars 1 forks source link

am2alertapi fix metrics reporting for multi-worker configuration #22

Closed EricHorst closed 2 years ago

EricHorst commented 2 years ago

In diagnosing a recent outage, it was noted that the am2alertapi counters were not increasing values. After some thought it became clear that the problem is with multiple workers they run as separate processes and thus each worker keeps its own metrics independently. (The worker count was increased in October 2021 in https://github.com/UWIT-UE/am2alertapi/commit/4539a9ced987c19789263bf9b84ef7d76143738d)

Researching suggests a solution either using prometheus_client multiprocess mode, example here: https://github.com/amitsaha/python-prometheus-demo/tree/master/flask_app_prometheus_multiprocessing

Or add worker number as a metric label and aggregate in prometheus.

Here's a reference: https://echorand.me/posts/python-prometheus-monitoring-options/

EricHorst commented 2 years ago
  1. Implemented multi-process metrics in am2alertapi.py
  2. Rolled out new version to all clusters and to prom01/prom02 https://github.com/UWIT-UE/am2alertapi/releases/tag/v1.0.7
  3. Changed alert rules to properly sum counters https://github.com/UWIT-UE/mci-ops/pull/391
  4. Added am2alertapi scraping and alert rules to prom01/prom02 which did not have them.