centerforaisafety / cerberus-cluster

HPC cluster code and configurations for running on OCI
Universal Permissive License v1.0
4 stars 0 forks source link

Monitoring ansible playbook #185

Open ghost opened 1 year ago

ghost commented 1 year ago

When news nodes come into the cluster(s) we need to be able to add them to the monitoring stack- write an ansible role/playbook to do that.

The ansible code should:

1) install prometheus, node-exporter, dgcm (if its a compute node) and promtail 2) config each service to run as a systemd job as a non root user 3) change the /etc/prometheus/files.d/cerberus-cluster.json (or file(s) like it) on the prometheus instances to add or remove nodes as they come and go from the cerberus cluster.

ghost commented 1 year ago

This script needs to be smart enough to check when nodes drop off and remove them from prometheus as well.