Closed gvegayon closed 5 years ago
Need to work on this ASAP
You would probably want the option to run them with a different seed when you restart them as well.
It would be good to be able to rerun these failed jobs with the same random seed and arguments to be able to regenerate the problem or different seed and arguments for debugging.
How about a function to return a list of argument sets corresponding to the failed jobs? That way they could be rerun as is or modified according to the user's need.
Perhaps it should an option of sbatch
(the underlying function that submits the jobs). The function could gain a new argument, e.g. what
, which suggests what parts of the should be resubmitted. In such case, there could be a function that returns the sequence of failed jobs. This implies to somewhat reimagine the workflow. A few changings that I foresee:
Submit the job. The last should be kept in memory so that the user could grab it by typing something like last_job()
. This is important b/c users can use the collect = TRUE
option and still be able to access to the last job.
Have a function called read_job
or something like that that allows recovering any job. For this, when saving the auxiliary files we should be saving either a plain text file or a binary rds file with the job call itself so that users can recover a job set up by simply typing the path to the Slurm job folder.
For debugging, one thing I find myself doing all the time is:
parallel::mclapply
has a tryCatch
wrapper, so some times I cannot tell whether a job has failed or not, so what I usually do is load one of the datasets and look at the data directly. In such cases mclapply usually returns a warning.I could try to tag the errors automatically and have a function, as mentioned earlier, to try to resubmit the job, but only a subset (in this case, those that failed). Need to check how to modify the ARRAY
variable in the batch file, and what is the limit on that string for Slurm. That should be in MaxArraySize
in the slurm.conf
file.
Maybe your #2 covers this? I just ran a bunch of jobs, some of which failed for unknown reasons, the 02-output-* files were not generated for the failed jobs. When I tried to rerun some of those jobs, individually, according to the job#, I was not able to recreate the error/failure, that is, the jobs completed successfully the second time around even though I used the same random seed. The problem is that now I cannot debug because I cannot regenerate the error.
OK, so part of this was implemented in 1dccce5dad51bf18324ce12547dfaae40d36b17f. Now jobs can be resubmitted very easily:
library(sluRm)
# A simple expr evaluation, WhoAmI() gives some info about the node
x <- Slurm_EvalQ(sluRm::WhoAmI(), njobs = 20, plan = "wait")
# Suppose jobs 1, 2, and 15-20 failed, then we can do the following
sbatch(x, array = "1,2,15-20")
# And, if status OK, then collect safely the entire array
ans <- Slurm_collect(x)
To keep it simple, users can check the status of a given submission with the state
function, or simply calling sacct
. state
will return an integer scaler telling whether the job has been submitted, is done, it has failed, or is still running, including a set of attributes that enumerate the state of each job in the array. So I think this is done :).
In some cases, a subset of the submitted jobs may fail because, e.g., time limit, memory limit, or other reasons. In such cases, it would be nice if there were a way to resubmit jobs that failed.
This was also suggested by @millstei.
In particular, we would need to do the following: