Infrastructure monitoring

rodecker commented 4 years ago

Some kind of monitoring system that sends mails when ring infrastructure servers or services are down. Monitoring of hosts and services should be automatically configured when they are added to ansible.

rodecker commented 4 years ago

Icinga, another nagios fork, or something else entirely?

leoluk commented 4 years ago

Prometheus with Alertmanager :)

isodude commented 3 years ago

Telegraf + VictoriaMetrics was really nice to set up. Either send Influx to Victoria or let Victoria fetch prometheus from Telegraf.

I also added MTR support to my Telegraf-fork which made it easy to get nice stats in grafana how hops are evolving over time. This could be useful for the Ring especially.

Let me know if it's of interest.

leoluk commented 3 years ago

For monitoring (vs. telemetry), Prometheus, node_exporter and Alertmanager is hard to beat.

isodude commented 3 years ago

I tried node_exporter first, but the 'everything shall be run on a different port' theme did not sit well with me.

So how it works is that Telegraf, which btw has excellent support out of the box for most things and has support for executing custom binaries that exports different formats (influx, json, simple etc), exports data via a output plugin that exports in prometheus format. VictoriaMetrics pulls the data. You can still run Alertmanager as you would, or use their own https://docs.victoriametrics.com/vmalert.html.

At the same time you get the same features as Thanos with storage over time etc.

I did have a look and there's a fairly new victoriametrics available straight in the repo. I would need to compile a telegraf from my own fork if there should be MTR support however. I also made a bit better TLS client certificate support, which means you could use client certificates between all nodes for transporting data.

So in short node_exporter + Alertmanager is technically the same as telegraf + victoriametrics.

isodude commented 3 years ago

If people like running prometheus, maybe this is interesting? https://opensourcelibs.com/lib/network_exporter

NLNOG / ring-ansible

Infrastructure monitoring #87