Open dcarsey opened 4 years ago
To close this ticket, find out: a. how long it will be until nagios is monitoring heal-ctmd b. what are the alerts sent to txscience@lists c. test the monitoring works once it is deployed (solicit help from the tech team)
Sent list of questions to Marcin.
Per Marcin 11-20-19
Nagios monitoring is now enabled on both heal-ctmd and ctmd
All of the alerts requested are going to txscience@lists.renci.org that includes
Follow-up queries from Kimberly:
• What “%-age full” triggers an alert for /var /opt /home and /? – we need to be CAREFUL testing this! • what % for cpu usage (easy to test)? • What % for memory?
We'll need to get Marcin's responses to @xu-hao because I've him if he could do some benchmarking to "acceptance test" the nagios rules are set-up as above.
In progress. Trying to create alerts and seeing what the system does. Making adjustments as we go.
In anticipation of putting CTMD in a cluster, determine monitoring we want ACIS to put in place. Reach out to Mac Chaffee through help@renci.org
from Mac:
Did you still want me to set up some kind of Nagios alerting for the overnight jobs on the ctmd VMs? If so, I'll need the command to execute to check that. But cron's emailing feature or some other kind of alerting inside your app might be better since you'll have more control over it.
Ticket #2319 with ACIS
Nagios monitoring will send alerts if something specific that is monitored goes down or out of established bounds.
Alerts should go to txscience@renci.org