JuliaParallel / ClusterManagers.jl

Other
242 stars 74 forks source link

Slurm Job creation crashes on master because "SLURM_JOB_ID" not set #127

Closed igor-krawczuk closed 4 years ago

igor-krawczuk commented 4 years ago

In order to get around the replace bug mentioned in #118 I installed directly from master, but this introduced another bug namely that the .out name change introduced in #123 causes job creation to crash, since it seems that on my cluster does not seem to set SLURM_JOB_ID (nor any other slurm variable) at the point in time which the code expects?

Removing the jobID solved the problem.

grahamas commented 4 years ago

I'm having the same problem. That code runs on the calling machine, which isn't a slurm node, so SLURM_JOB_ID isn't set.

vchuravy commented 4 years ago

Probably that code needs to check if that environment variable is available and only then load it.

grahamas commented 4 years ago

Probably that code needs to check if that environment variable is available and only then load it.

I don't think so. The next line creates the srun command and uses jobID to set the name of the output file. What it seems like it wants to do is either a) use %j to put the job ID into the name of the output file, or b) to give the output file a known name so that we can find it. However, I don't think we can do both. Possibly after we run the command, we can figure out the ID and then know the name of the output file, but I'm not sure how to do that.

(I'm happy to do the coding and make a PR if someone can tell me what it's supposed to do; for now I'll get something that works and we can see what you think)

mkschleg commented 4 years ago

We can probably do a quick check for job files in the directory we are wanting to save (which already kinda exists) and instead of deleting all the files:

All of these should be pretty straightforward to add. Another thing you could add is a flag which turns job files on/off and the job_id functionality on and off.

jishnub commented 4 years ago

In my cluster I have noticed that the SLURM_JOB_ID is set after launching a job using srun, as might be expected from the name of the variable. A workaround at the moment is to submit an interactive job, run julia on the compute note and add workers using ClusterManagers. However we should not be expecting it to be set before an srun command is called

KajWiik commented 4 years ago

Scripts in https://github.com/magerton/julia-slurm-example work OK, maybe this should be mentioned in the front page and documentation?