Closed belforte closed 2 years ago
maybe should use a different e-group as SENDTO ? check other crontabs...
Note from Stefano email:
with reference to recent thread in MatterMost:
currenly we have these e-groups (skipping common cms-service prefix):
1. crab-sysadmin = people who can log on VM's and gain root access, VOC is there
as backup in case of emergency need with us unavailable
2. crab-operators = communications among CRAB operators, like this one
3. crab3-htcondor-monitor ad-hoc e-group for "messages from crontabs"
4. I think we already have setup places where messages are sent to crab-operators,
though I can't recall now.
So I propose:
a) keep crab-sysadmin as is and use it mainly to control access, but not for communications,
and limit/avoid sending mails to it
b) get rid of crab3-htcondor-monitor
c) send all alerts/alarms, including those from crontabs to crab-operators
d) the c) above means to redirect mails from root@localhost to crab-operators
e) up to "developers" to decide in each crontab if it is better to use sendmail
(so can tune mail subeject) or to print to stdout and rely on d) (simpler)
f) would be useful if we could move the line
sendmail::root_email: [cms-service-crab-operators@cern.ch](mailto:cms-service-crab-operators@cern.ch)
to a common pupper profile, instead of having it in the node-specific YAML
For d) and f), I open new MR here: https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/-/merge_requests/56
Previously, alert scripts we have in puppet code are sent to cms-service-crab-operators@cern.ch
, but the e-group only accepts incoming emails from valid CERN account. That is why we never get emails from alerts.
After Stefano changed to Everyone
, email alerts are working. We got alerts from this issue https://github.com/dmwm/CRABServer/issues/7159.
From https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/-/merge_requests/56, I changed all alert scripts to send email to root@localhost
and set the address we want to receive alerts in hiera sendmail:root_email
.
MR merged and alerts are working fine.
it looks like current
/root/diskTest.sh
does not send the mail when disk is over threshold. To be investigated.