dmwm / CRABServer

15 stars 38 forks source link

diskTest.sh puppet crontab not working #7165

Closed belforte closed 2 years ago

belforte commented 2 years ago

it looks like current /root/diskTest.sh does not send the mail when disk is over threshold. To be investigated.

[root@crab-prod-tw01 ~]# bash -x ./diskTest.sh
+ SEND_TO=cms-service-crab-operators@cern.ch
+ THREASHOLD=10
+ CRITICAL_PATH=
+ isCritical=0
+ TMPFILE=/tmp/diskTest.tmp
+ LOGFILE=/var/log/diskTest.log
+ /bin/df -h
+ grep -vE 'Filesystem|tmpfs|cvmfs2|VolGroup'
++ date
+ echo 'Tue Mar 22 14:30:57 CET 2022 : DiskTest : START'
+ IFS=
+ read -r path
++ echo '/dev/vda1       160G   67G   94G  42% /'
++ awk '{print $(NF-1)}'
++ sed s/%//g
+ size=42
+ '[' 42 -ge 10 ']'
+ CRITICAL_PATH='\n/dev/vda1       160G   67G   94G  42% /'
+ isCritical=1
+ IFS=
+ read -r path
++ echo '/dev/vdb        504G  133G  346G  28% /data'
++ awk '{print $(NF-1)}'
++ sed s/%//g
+ size=28
+ '[' 28 -ge 10 ']'
+ CRITICAL_PATH='\n/dev/vda1       160G   67G   94G  42% /\n/dev/vdb        504G  133G  346G  28% /data'
+ isCritical=1
+ IFS=
+ read -r path
++ echo 'AFS             2.0T     0  2.0T   0% /afs'
++ awk '{print $(NF-1)}'
++ sed s/%//g
+ size=0
+ '[' 0 -ge 10 ']'
+ IFS=
+ read -r path
++ echo 'overlay         160G   67G   94G  42% /var/lib/docker/overlay2/b7e7e38565dc0cfd17f207ee2f4c5bd46f29581ba34d230eb42db6b941c2ecf0/merged'
++ awk '{print $(NF-1)}'
++ sed s/%//g
+ size=42
+ '[' 42 -ge 10 ']'
+ CRITICAL_PATH='\n/dev/vda1       160G   67G   94G  42% /\n/dev/vdb        504G  133G  346G  28% /data\noverlay         160G   67G   94G  42% /var/lib/docker/overlay2/b7e7e38565dc0cfd17f207ee2f4c5bd46f29581ba34d230eb42db6b941c2ecf0/merged'
+ isCritical=1
+ IFS=
+ read -r path
++ echo 'overlay         160G   67G   94G  42% /var/lib/docker/overlay2/9150f74047871b4abb0ea46f0643d5ab08d18fdf8402a8d9fcbc22f09e1cd100/merged'
++ awk '{print $(NF-1)}'
++ sed s/%//g
+ size=42
+ '[' 42 -ge 10 ']'
+ CRITICAL_PATH='\n/dev/vda1       160G   67G   94G  42% /\n/dev/vdb        504G  133G  346G  28% /data\noverlay         160G   67G   94G  42% /var/lib/docker/overlay2/b7e7e38565dc0cfd17f207ee2f4c5bd46f29581ba34d230eb42db6b941c2ecf0/merged\noverlay         160G   67G   94G  42% /var/lib/docker/overlay2/9150f74047871b4abb0ea46f0643d5ab08d18fdf8402a8d9fcbc22f09e1cd100/merged'
+ isCritical=1
+ IFS=
+ read -r path
+ '[' 1 -eq 1 ']'
+ echo -e 'CRITICAL_PATH@crab-prod-tw01.cern.ch:\n/dev/vda1       160G   67G   94G  42% /\n/dev/vdb        504G  133G  346G  28% /data\noverlay         160G   67G   94G  42% /var/lib/docker/overlay2/b7e7e38565dc0cfd17f207ee2f4c5bd46f29581ba34d230eb42db6b941c2ecf0/merged\noverlay         160G   67G   94G  42% /var/lib/docker/overlay2/9150f74047871b4abb0ea46f0643d5ab08d18fdf8402a8d9fcbc22f09e1cd100/merged'
+ SUBJECT='[Alert] DISK: Disk size critical at crab-prod-tw01.cern.ch from  diskTest.sh@crab-prod-tw01.cern.ch'
+ mail -s '[Alert] DISK: Disk size critical at crab-prod-tw01.cern.ch from  diskTest.sh@crab-prod-tw01.cern.ch' cms-service-crab-operators@cern.ch
++ date
+ echo -e 'Tue Mar 22 14:30:57 CET 2022 : CRITICAL_PATH@crab-prod-tw01.cern.ch:\n/dev/vda1       160G   67G   94G  42% /\n/dev/vdb        504G  133G  346G  28% /data\noverlay         160G   67G   94G  42% /var/lib/docker/overlay2/b7e7e38565dc0cfd17f207ee2f4c5bd46f29581ba34d230eb42db6b941c2ecf0/merged\noverlay         160G   67G   94G  42% /var/lib/docker/overlay2/9150f74047871b4abb0ea46f0643d5ab08d18fdf8402a8d9fcbc22f09e1cd100/merged'
++ date
+ echo 'Tue Mar 22 14:30:57 CET 2022 : DiskTest : END'
+ rm -f /tmp/diskTest.tmp
[root@crab-prod-tw01 ~]# 
belforte commented 2 years ago

maybe should use a different e-group as SENDTO ? check other crontabs...

novicecpp commented 2 years ago

Note from Stefano email:

with reference to recent thread in MatterMost:
currenly we have these e-groups (skipping common cms-service prefix):
1. crab-sysadmin = people who can log on VM's and gain root access, VOC is there
   as backup in case of emergency need with us unavailable
2. crab-operators = communications among CRAB operators, like this one
3. crab3-htcondor-monitor ad-hoc e-group for "messages from crontabs"
4. I think we already have setup places where messages are sent to crab-operators,
   though I can't recall now.

So I propose:

a) keep crab-sysadmin as is and use it mainly to control access, but not for communications,
   and limit/avoid sending mails to it
b) get rid of crab3-htcondor-monitor
c) send all alerts/alarms, including those from crontabs to crab-operators
d) the c) above means to redirect mails from root@localhost to crab-operators
e) up to "developers" to decide in each crontab if it is better to use sendmail
   (so can tune mail subeject) or to print to stdout and rely on d) (simpler)
f) would be useful if we could move the line
  sendmail::root_email: [cms-service-crab-operators@cern.ch](mailto:cms-service-crab-operators@cern.ch)
   to a common pupper profile, instead of having it in the node-specific YAML

For d) and f), I open new MR here: https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/-/merge_requests/56

novicecpp commented 2 years ago

Previously, alert scripts we have in puppet code are sent to cms-service-crab-operators@cern.ch, but the e-group only accepts incoming emails from valid CERN account. That is why we never get emails from alerts.

After Stefano changed to Everyone, email alerts are working. We got alerts from this issue https://github.com/dmwm/CRABServer/issues/7159.

From https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/-/merge_requests/56, I changed all alert scripts to send email to root@localhost and set the address we want to receive alerts in hiera sendmail:root_email.

MR merged and alerts are working fine.