CLIP-HPC / goslmailer

GoSlurmMailer - drop in replacement for default slurm MailProg. Delivers slurm job messages to various destinations.
40 stars 6 forks source link

Add support for SLURM < 21.08.x and improve error handling #12

Closed timeu closed 2 years ago

timeu commented 2 years ago

In oder SLURM version (< 21.08.x) the mail program is executed without setting any SLURM job environment variables (#4). We fallback to parsing the subject line that is passed to the mail program to retrieve jobid and other information such as job state and mail type. Additionally the function for retrieving job related information via sacct and sstat now properly return error messages, if the call fails. This fixes #7

Additonal end2end tests are added to test the above fixes.

pja237 commented 2 years ago

Let's do cosmetics and move metrics to job_data https://github.com/CLIP-HPC/goslmailer/blob/96a78280e8d93896a25572cc609c07fff9252c30/internal/slurmjob/sacct.go#L13 https://github.com/CLIP-HPC/goslmailer/blob/96a78280e8d93896a25572cc609c07fff9252c30/internal/slurmjob/sacct.go#L41

here, so it's more readable https://github.com/CLIP-HPC/goslmailer/blob/more_e2e/internal/slurmjob/job_data.go

Also perhaps a line here with updated template guide https://github.com/CLIP-HPC/goslmailer/blob/more_e2e/templates/README.md

pja237 commented 2 years ago

@tdido Hey, we're wrapping things up in this PR, more-less it's ready to go, some cosmetics will happen but it's functional now. Would you be interested in helping us to test-drive it before merge? If you find time, just checkout this branch and run make build (skipping tests until we write up in the README the requirements.

tdido commented 2 years ago

For sure, I'll try it out and let you know.

tdido commented 2 years ago

OK, it's working!

Here's the output I'm getting:

<b>Job 219922 Ended</b>
<i>Created Tue, 31 May 2022 12:40:59 UTC</i>

<pre>------------------------------
Job Name         : wrap
Job ID           : 219922
User             : 
Partition        : 
Nodes Used       : 
Cores            : 2
Job state        : COMPLETED
Exit Code        : 
Submit           : 2022-05-31T12:40:56
Start            : 2022-05-31T12:40:56
End              : 2022-05-31T12:40:57
Res. Walltime    : 02:00:00
Used Walltime    : 
Used CPU time    : 00:00.003
% User (Comp)    : 33.33%
% System (I/O)   : 33.33%
Memory Requested : 4.2 GB
Max Memory Used  : 1.2 MB
Max Disk Write   : 0 B
Max Disk Read    : 0 B
------------------------------</pre>
<b>- TIP: Please consider lowering the ammount of requested memory in the future, your job has consumed less then half of the requested memory.</b>
<b>- TIP: Please consider lowering the amount of requested CPU cores in the future, your job has consumed less than half of requested CPU cores</b>
<b>- TIP: Your job was submitted with a walltime of 02:00:00 and finished in less half of the time, consider reducing the walltime and submit it to LONG QOS</b>

The only things of note is that I can't get the "User", "Partition", "Nodes used", and "Used Walltime" fields to populate (even if using the -p and -w arguments to sbatch).

timeu commented 2 years ago

@tdido : For User, Partition and Nodes can you try to use this template instead the default one: https://github.com/CLIP-HPC/goslmailer/blob/more_e2e/test_e2e/cases/test_05/conf/adaptive_card_template.json The used walltime should work tough. Need to check why

timeu commented 2 years ago

@tdido : I forgot to replace the Used Walltime in the template with the one from the sacctmetrics struct: https://github.com/CLIP-HPC/goslmailer/blob/more_e2e/test_e2e/cases/test_05/conf/adaptive_card_template.json#L141

timeu commented 2 years ago

@tdido : Also I see that your email doesn't render the mail as HTML. For mutt you need to drop following config into /etc/Muttrc.local:

# Local configuration for Mutt.
set content_type="text/html"
pja237 commented 2 years ago

Instead of .Job.SlurmEnvironment.SLURM_JOB_USER to get the user, change the template to use the .Job.JobStats.User variables from SacctMetrics (since SlurmEnvironment in older version will contain only the jobid/arrayid vars, rest will come from jobstats (which works same in all versions):

https://github.com/CLIP-HPC/goslmailer/blob/96a78280e8d93896a25572cc609c07fff9252c30/internal/slurmjob/sacct.go#L15

tdido commented 2 years ago

Cheers lads, I had forgotten about the templating concept :P All looking great now. Thanks!

pja237 commented 2 years ago

Great, then we wrap this pr up, merge and publish a new release tomorrow. Thanks for the help :+1: :1st_place_medal: