Closed igor-krawczuk closed 4 years ago
I'm having the same problem. That code runs on the calling machine, which isn't a slurm node, so SLURM_JOB_ID isn't set.
Probably that code needs to check if that environment variable is available and only then load it.
Probably that code needs to check if that environment variable is available and only then load it.
I don't think so. The next line creates the srun
command and uses jobID
to set the name of the output file. What it seems like it wants to do is either a) use %j
to put the job ID into the name of the output file, or b) to give the output file a known name so that we can find it. However, I don't think we can do both. Possibly after we run the command, we can figure out the ID and then know the name of the output file, but I'm not sure how to do that.
(I'm happy to do the coding and make a PR if someone can tell me what it's supposed to do; for now I'll get something that works and we can see what you think)
We can probably do a quick check for job files in the directory we are wanting to save (which already kinda exists) and instead of deleting all the files:
All of these should be pretty straightforward to add. Another thing you could add is a flag which turns job files on/off and the job_id functionality on and off.
In my cluster I have noticed that the SLURM_JOB_ID is set after launching a job using srun, as might be expected from the name of the variable. A workaround at the moment is to submit an interactive job, run julia on the compute note and add workers using ClusterManagers. However we should not be expecting it to be set before an srun command is called
Scripts in https://github.com/magerton/julia-slurm-example work OK, maybe this should be mentioned in the front page and documentation?
In order to get around the replace bug mentioned in #118 I installed directly from master, but this introduced another bug namely that the .out name change introduced in #123 causes job creation to crash, since it seems that on my cluster does not seem to set SLURM_JOB_ID (nor any other slurm variable) at the point in time which the code expects?
Removing the jobID solved the problem.