It appears that mail is not set up on most compute nodes and so the MAILTO in crontab won't work (manifestly does not work on Fox). It's not clear to me what will happen other than logging if something goes wrong when sonar is run by systemd either.
This is sort of a big deal - we need some type of auto-monitoring for the system, there are too many things that can go wrong, witness what happened when the nvidia-smi format changed.
I think we need two things:
[ ] when on-node infra reports failures, these failures should be embedded in the infra output somehow -- as a field in sonar data, a field in sysinfo data, a field in the sacctd data. on ingest, these fields can be factored out and reported, or there can be a periodic job that scrubs the recent records and performs reporting, probably better
[ ] when a node stops reporting (we should see this in missing heartbeats from sonar, but there's also the issue of eg a master node not reporting sacctd data) there should be a report that is a little more than just a colored line on a dashboard. probably somebody should get mail. this is not completely obvious for missing heartbeats from a node since we don't want any kind of mail storm for that and nodes can be down for a long time.
It appears that mail is not set up on most compute nodes and so the MAILTO in crontab won't work (manifestly does not work on Fox). It's not clear to me what will happen other than logging if something goes wrong when sonar is run by systemd either.
This is sort of a big deal - we need some type of auto-monitoring for the system, there are too many things that can go wrong, witness what happened when the nvidia-smi format changed.
I think we need two things: