Closed tdido closed 2 years ago
Ok, this one is tricky, lets try with these:
goslmailer:2022/05/16 16:13:12.850509 getjobcontext.go:127: slurmjob.SlurmEnvironment{SLURM_ARRAY_JOB_ID:"1052446", SLURM_ARRAY_TASK_COUNT:"2", SLURM_ARRAY_TASK_ID:"2", SLURM_ARRAY_TASK_MAX:"2", SLURM_ARRAY_TASK_MIN:"1", SLURM_ARRAY_TASK_STEP:"1", SLURM_CLUSTER_NAME:"clip", SLURM_JOB_ACCOUNT:"hpc", SLURM_JOB_DERIVED_EC:"0", SLURM_JOB_EXIT_CODE:"0", SLURM_JOB_EXIT_CODE2:"0:0", SLURM_JOB_EXIT_CODE_MAX:"0", SLURM_JOB_EXIT_CODE_MIN:"0", SLURM_JOB_GID:"1999", SLURM_JOB_GROUP:"is.grp", SLURM_JOBID:"1052446", SLURM_JOB_ID:"1052446", SLURM_JOB_MAIL_TYPE:"Ended", SLURM_JOB_NAME:"wrap", SLURM_JOB_NODELIST:"stg-c2-0", SLURM_JOB_PARTITION:"c", SLURM_JOB_QUEUED_TIME:"", SLURM_JOB_RUN_TIME:"00:00:20", SLURM_JOB_STATE:"COMPLETED", SLURM_JOB_STDIN:"/dev/null", SLURM_JOB_UID:"10303", SLURM_JOB_USER:"uemit.seren", SLURM_JOB_WORK_DIR:"/users/uemit.seren"}
you could try replacing in slurm config the MailProg to point to a shell script which just does env >> /tmp/jobenv.txt
so we can see if the vars exist in the context of the goslmailer execution?
which slurm version are you running?
could you please describe your slurm-gcp setup? is it perhaps a federation/multi-cluster setup?
1.
goslmailer:2022/05/19 08:43:28.307456 getjobcontext.go:127: slurmjob.SlurmEnvironment{SLURM_ARRAY_JOB_ID:"",
SLURM_ARRAY_TASK_COUNT:"",
SLURM_ARRAY_TASK_ID:"",
SLURM_ARRAY_TASK_MAX:"",
SLURM_ARRAY_TASK_MIN:"",
SLURM_ARRAY_TASK_STEP:"",
SLURM_CLUSTER_NAME:"tropicoso",
SLURM_JOB_ACCOUNT:"",
SLURM_JOB_DERIVED_EC:"",
SLURM_JOB_EXIT_CODE:"",
SLURM_JOB_EXIT_CODE2:"",
SLURM_JOB_EXIT_CODE_MAX:"",
SLURM_JOB_EXIT_CODE_MIN:"",
SLURM_JOB_GID:"",
SLURM_JOB_GROUP:"",
SLURM_JOBID:"",
SLURM_JOB_ID:"",
SLURM_JOB_MAIL_TYPE:"",
SLURM_JOB_NAME:"",
SLURM_JOB_NODELIST:"",
SLURM_JOB_PARTITION:"",
SLURM_JOB_QUEUED_TIME:"",
SLURM_JOB_RUN_TIME:"",
SLURM_JOB_STATE:"",
SLURM_JOB_STDIN:"",
SLURM_JOB_UID:"",
SLURM_JOB_USER:"",
SLURM_JOB_WORK_DIR:""}
I can't get a shell script to run for some reason, but I used go code similar to the one in the first post, in this case printing $SLURM_JOBID
, $SLURM_JOB_ID
, $USER
, and $SLURM_CLUSTER_NAME
. Only the cluster name seems to be defined (and not $USER
to my surprise), just like you can see in the log output in point 1.
slurm 20.11.7
It's a pretty much default install, with a login and a controller that are permanent, and compute nodes that get spun up on demand. Homes mounted on Google Filestore.
So it would seem that most env variables are missing from the MailProg context and therefore must be some issue with Slurm. I guess I'll start digging there. Any tips appreciated ;)
What i can tell you is pretty limited, since we don't have a gcp setup to troubleshoot this, but couple of things come to mind:
If anything else comes to mind, i'll write it down, else perhaps you could ask about this situation in the slurm mailing list or in the slurm-gcp? If you do, please point me to the link, i'd like to follow that.
Thanks for the pointers. I'll start by spinning up a new cluster with the latest version of Slurm to test, so we remove some variables from the mix. I'll also try it with our old-fashioned local cluster.
I'll post my findings here.
@tdido : Looking through the SLURM documentation about MailProg, it seems that the environment variables are only available in the 21.x release version (compare: https://slurm.schedmd.com/slurm.conf.html#OPT_MailProg vs https://slurm.schedmd.com/archive/slurm-20.11.7/slurm.conf.html). We will try to implement a fallback of parsing the jobid from the subject line, however it would be great if you could verify it by testing it out on a 21.x version of SLURM
Ahhh, good catch, thanks for the linkd. It then makes sense that it also didn't work in our local cluster, also running v20.x.
I still have to get around to setting a slurm-gcp instance with v21, will try to do it as soon as I can.
@tdido: Thanks ! Digging through the SLURM release notes, I found this line: https://github.com/SchedMD/slurm/blob/master/NEWS#L839 It looks like the additinal SLURM job environment information was introduced in SLURM 21.08.0. Considering that we rely heavily on those environment variables in both the templates as well as the logic to retrieve the job metrics, we need to discuss how much efffort it is to implement a fallback for SLURM < 21.08.x I will update the README.md to highlight that in the currrent state goslmailer requires SLURM >=21.08.x
Yeah, I guess it makes sense. I may try my luck with a patch if I find the time, because it's unlikely that I'll be able to migrate all our clusters in the near future and I'm really looking forward to working on a Matrix plugin.
In any case, I'll close this for now since the issue itself is clarified. I'll create new issues with any further things of note.
Thanks!
@tdido : FYI: We decided to look into implementing a workaround for SLURM < 21.08.x. So I will keep the ticket open for now
Just a summary of this topic, since it looks like slurm <21.08 doesn't populate (SLURM_*) environment variables, making gosl unusable with older versions, we'll implement a fallback code path (jobid from subject, then sacct) to support older versions of slurm.
While at this, let's also cover #7
After temporarily fixing the paths as described in #3, I've been able to generate an email from a job submission. The data fields are however empty.
I've traced this to
goslmailer
not being able to fetch the jobid from$SLURM_JOBID
.I've first written some go code to test if SLURM is setting the variables:
Running this binary through
sbatch
shows the jobid twice, so both variables seem to be defined.I've then added the same code to the
main
function ingoslmailer.go
, and none of the variables contain any data.If you could give me any pointers as to how to continue debugging the problem I'd be extremely grateful.