CLIP-HPC / goslmailer

GoSlurmMailer - drop in replacement for default slurm MailProg. Delivers slurm job messages to various destinations.
40 stars 6 forks source link

Show used CPU walltime and other usage metrics as well when job fails #15

Open hh0rva1h opened 2 years ago

hh0rva1h commented 2 years ago

Sometimes my jobs run into a timeout as in the following Telegram notification:

image

For resuming the job on a resubmission it would be nice to see the statistics in order to adapt the resource requirements accordingly the same way as for successful jobs:

image

Just a comment regarding the last screenshot: It seems the tip about lowering cpu cores here is wrong, as indicated by 94.80% cpu usage (I also checked during the job was running, htop was showed me over 3000% usage sometimes).

pja237 commented 2 years ago

Hey @hh0rva1h,

thanks for the feedback, the first one is a result of slurm database still not being updated with all the job statistics at the moment of timeout and gosl being invoked, we get 0s for those (db update happens sometime after we're finished with sending the notification). We have no straightforward solution for that, we'll experiment to see if we can work around that. In case of TIMEOUTs, it's quite obvious, timelimit needs to be adopted :smiley_cat:

The cpu hint comes into play if the (number of cpus requested)*runtime/2 > (the sum of the SystemCPU and UserCPU time used by the job).

In case of the job above:

sacct -j 41094400 -o cputime%30,ncpus,elapsed%30,totalcpu%30
                       CPUTime      NCPUS                        Elapsed                       TotalCPU 
------------------------------ ---------- ------------------------------ ------------------------------ 
                  128-01:19:00         30                     4-06:26:38                    34-13:46:51 
                  128-01:19:00         30                     4-06:26:38                    34-13:46:51 
                  128-01:19:00         30                     4-06:26:38                      00:00.001 

cputime=NCPUS*Elapsed if cputime/2 > totalcpu (sum of sys and usercpu) then the hint fires.

pja237 commented 1 year ago

One "radical" way of solving this "race" between goslmailer querying slurmdb for job statistics and it being updated on job completion might be to push this code to gobler: https://github.com/CLIP-HPC/goslmailer/blob/f8118e6997edf6f5da11a7dc554890e15501f642/cmd/goslmailer/goslmailer.go#L67-L74 And change the architecture of the app so that all connectors do mandatory spooling and gobler is always in the message path filling out gobs with missing data. Just something to think about...