NAICNO / Jobanalyzer

Easy to use resource usage report
MIT License
0 stars 1 forks source link

Error reporting backchannel + auto-monitoring #585

Open lars-t-hansen opened 3 weeks ago

lars-t-hansen commented 3 weeks ago

It appears that mail is not set up on most compute nodes and so the MAILTO in crontab won't work (manifestly does not work on Fox). It's not clear to me what will happen other than logging if something goes wrong when sonar is run by systemd either.

This is sort of a big deal - we need some type of auto-monitoring for the system, there are too many things that can go wrong, witness what happened when the nvidia-smi format changed.

I think we need two things: