CLIP-HPC / goslmailer

GoSlurmMailer - drop in replacement for default slurm MailProg. Delivers slurm job messages to various destinations.
40 stars 6 forks source link

Failure to get jobid from environment (and therefore job info) #4

Closed tdido closed 2 years ago

tdido commented 2 years ago

After temporarily fixing the paths as described in #3, I've been able to generate an email from a job submission. The data fields are however empty.

I've traced this to goslmailer not being able to fetch the jobid from $SLURM_JOBID.

I've first written some go code to test if SLURM is setting the variables:

package main
import (
    "os"
    "fmt"
)

func main() {
    jid := os.Getenv("SLURM_JOBID")
    jid2 := os.Getenv("SLURM_JOB_ID")
    fmt.Println("jids",jid,jid2)
}

Running this binary through sbatch shows the jobid twice, so both variables seem to be defined.

I've then added the same code to the main function in goslmailer.go, and none of the variables contain any data.

If you could give me any pointers as to how to continue debugging the problem I'd be extremely grateful.

pja237 commented 2 years ago

Ok, this one is tricky, lets try with these:

  1. can you check the logs, there should be a line similar to this:
goslmailer:2022/05/16 16:13:12.850509 getjobcontext.go:127: slurmjob.SlurmEnvironment{SLURM_ARRAY_JOB_ID:"1052446", SLURM_ARRAY_TASK_COUNT:"2", SLURM_ARRAY_TASK_ID:"2", SLURM_ARRAY_TASK_MAX:"2", SLURM_ARRAY_TASK_MIN:"1", SLURM_ARRAY_TASK_STEP:"1", SLURM_CLUSTER_NAME:"clip", SLURM_JOB_ACCOUNT:"hpc", SLURM_JOB_DERIVED_EC:"0", SLURM_JOB_EXIT_CODE:"0", SLURM_JOB_EXIT_CODE2:"0:0", SLURM_JOB_EXIT_CODE_MAX:"0", SLURM_JOB_EXIT_CODE_MIN:"0", SLURM_JOB_GID:"1999", SLURM_JOB_GROUP:"is.grp", SLURM_JOBID:"1052446", SLURM_JOB_ID:"1052446", SLURM_JOB_MAIL_TYPE:"Ended", SLURM_JOB_NAME:"wrap", SLURM_JOB_NODELIST:"stg-c2-0", SLURM_JOB_PARTITION:"c", SLURM_JOB_QUEUED_TIME:"", SLURM_JOB_RUN_TIME:"00:00:20", SLURM_JOB_STATE:"COMPLETED", SLURM_JOB_STDIN:"/dev/null", SLURM_JOB_UID:"10303", SLURM_JOB_USER:"uemit.seren", SLURM_JOB_WORK_DIR:"/users/uemit.seren"}
  1. you could try replacing in slurm config the MailProg to point to a shell script which just does env >> /tmp/jobenv.txt so we can see if the vars exist in the context of the goslmailer execution?

  2. which slurm version are you running?

  3. could you please describe your slurm-gcp setup? is it perhaps a federation/multi-cluster setup?

tdido commented 2 years ago

1.

goslmailer:2022/05/19 08:43:28.307456 getjobcontext.go:127: slurmjob.SlurmEnvironment{SLURM_ARRAY_JOB_ID:"",
SLURM_ARRAY_TASK_COUNT:"",
SLURM_ARRAY_TASK_ID:"",
SLURM_ARRAY_TASK_MAX:"",
SLURM_ARRAY_TASK_MIN:"",
SLURM_ARRAY_TASK_STEP:"",
SLURM_CLUSTER_NAME:"tropicoso",
SLURM_JOB_ACCOUNT:"",
SLURM_JOB_DERIVED_EC:"",
SLURM_JOB_EXIT_CODE:"",
SLURM_JOB_EXIT_CODE2:"",
SLURM_JOB_EXIT_CODE_MAX:"",
SLURM_JOB_EXIT_CODE_MIN:"",
SLURM_JOB_GID:"",
SLURM_JOB_GROUP:"",
SLURM_JOBID:"",
SLURM_JOB_ID:"",
SLURM_JOB_MAIL_TYPE:"",
SLURM_JOB_NAME:"",
SLURM_JOB_NODELIST:"",
SLURM_JOB_PARTITION:"",
SLURM_JOB_QUEUED_TIME:"",
SLURM_JOB_RUN_TIME:"",
SLURM_JOB_STATE:"",
SLURM_JOB_STDIN:"",
SLURM_JOB_UID:"",
SLURM_JOB_USER:"",
SLURM_JOB_WORK_DIR:""}
  1. I can't get a shell script to run for some reason, but I used go code similar to the one in the first post, in this case printing $SLURM_JOBID, $SLURM_JOB_ID, $USER, and $SLURM_CLUSTER_NAME. Only the cluster name seems to be defined (and not $USER to my surprise), just like you can see in the log output in point 1.

  2. slurm 20.11.7

  3. It's a pretty much default install, with a login and a controller that are permanent, and compute nodes that get spun up on demand. Homes mounted on Google Filestore.

So it would seem that most env variables are missing from the MailProg context and therefore must be some issue with Slurm. I guess I'll start digging there. Any tips appreciated ;)

pja237 commented 2 years ago

What i can tell you is pretty limited, since we don't have a gcp setup to troubleshoot this, but couple of things come to mind:

If anything else comes to mind, i'll write it down, else perhaps you could ask about this situation in the slurm mailing list or in the slurm-gcp? If you do, please point me to the link, i'd like to follow that.

tdido commented 2 years ago

Thanks for the pointers. I'll start by spinning up a new cluster with the latest version of Slurm to test, so we remove some variables from the mix. I'll also try it with our old-fashioned local cluster.

I'll post my findings here.

timeu commented 2 years ago

@tdido : Looking through the SLURM documentation about MailProg, it seems that the environment variables are only available in the 21.x release version (compare: https://slurm.schedmd.com/slurm.conf.html#OPT_MailProg vs https://slurm.schedmd.com/archive/slurm-20.11.7/slurm.conf.html). We will try to implement a fallback of parsing the jobid from the subject line, however it would be great if you could verify it by testing it out on a 21.x version of SLURM

tdido commented 2 years ago

Ahhh, good catch, thanks for the linkd. It then makes sense that it also didn't work in our local cluster, also running v20.x.

I still have to get around to setting a slurm-gcp instance with v21, will try to do it as soon as I can.

timeu commented 2 years ago

@tdido: Thanks ! Digging through the SLURM release notes, I found this line: https://github.com/SchedMD/slurm/blob/master/NEWS#L839 It looks like the additinal SLURM job environment information was introduced in SLURM 21.08.0. Considering that we rely heavily on those environment variables in both the templates as well as the logic to retrieve the job metrics, we need to discuss how much efffort it is to implement a fallback for SLURM < 21.08.x I will update the README.md to highlight that in the currrent state goslmailer requires SLURM >=21.08.x

tdido commented 2 years ago

Yeah, I guess it makes sense. I may try my luck with a patch if I find the time, because it's unlikely that I'll be able to migrate all our clusters in the near future and I'm really looking forward to working on a Matrix plugin.

In any case, I'll close this for now since the issue itself is clarified. I'll create new issues with any further things of note.

Thanks!

timeu commented 2 years ago

@tdido : FYI: We decided to look into implementing a workaround for SLURM < 21.08.x. So I will keep the ticket open for now

pja237 commented 2 years ago

Just a summary of this topic, since it looks like slurm <21.08 doesn't populate (SLURM_*) environment variables, making gosl unusable with older versions, we'll implement a fallback code path (jobid from subject, then sacct) to support older versions of slurm.

While at this, let's also cover #7