RENCI / ctmd

MIT License
2 stars 0 forks source link

Nagios monitoring #155

Open dcarsey opened 4 years ago

dcarsey commented 4 years ago

Ticket #2319 with ACIS

Nagios monitoring will send alerts if something specific that is monitored goes down or out of established bounds.

Alerts should go to txscience@renci.org

krobasky commented 4 years ago

To close this ticket, find out: a. how long it will be until nagios is monitoring heal-ctmd b. what are the alerts sent to txscience@lists c. test the monitoring works once it is deployed (solicit help from the tech team)

dcarsey commented 4 years ago

Sent list of questions to Marcin.

dcarsey commented 4 years ago

Per Marcin 11-20-19

Nagios monitoring is now enabled on both heal-ctmd and ctmd

All of the alerts requested are going to txscience@lists.renci.org that includes

• What “%-age full” triggers an alert for /var /opt /home and /? – we need to be CAREFUL testing this! • what % for cpu usage (easy to test)? • What % for memory?

krobasky commented 4 years ago

We'll need to get Marcin's responses to @xu-hao because I've him if he could do some benchmarking to "acceptance test" the nagios rules are set-up as above.

dcarsey commented 4 years ago

In progress. Trying to create alerts and seeing what the system does. Making adjustments as we go.

dcarsey commented 1 year ago

In anticipation of putting CTMD in a cluster, determine monitoring we want ACIS to put in place. Reach out to Mac Chaffee through help@renci.org

from Mac:

Did you still want me to set up some kind of Nagios alerting for the overnight jobs on the ctmd VMs? If so, I'll need the command to execute to check that. But cron's emailing feature or some other kind of alerting inside your app might be better since you'll have more control over it.