Smooth way to re-run of failed jobs

gvegayon commented 6 years ago

In some cases, a subset of the submitted jobs may fail because, e.g., time limit, memory limit, or other reasons. In such cases, it would be nice if there were a way to resubmit jobs that failed.

This was also suggested by @millstei.

In particular, we would need to do the following:

[ ] Catch errors on slurm side, and return with a particular code on the 0#-finito%a.out file indicating that there was an error.
[ ] This could be done by creating a function to parse slurm output files.
[ ] Also, it would be best if we could have a function to read/write analysis of outputs

gvegayon commented 5 years ago

Need to work on this ASAP

pmarjora commented 5 years ago

You would probably want the option to run them with a different seed when you restart them as well.

millstei commented 5 years ago

It would be good to be able to rerun these failed jobs with the same random seed and arguments to be able to regenerate the problem or different seed and arguments for debugging.

millstei commented 5 years ago

How about a function to return a list of argument sets corresponding to the failed jobs? That way they could be rerun as is or modified according to the user's need.

gvegayon commented 5 years ago

Perhaps it should an option of sbatch (the underlying function that submits the jobs). The function could gain a new argument, e.g. what, which suggests what parts of the should be resubmitted. In such case, there could be a function that returns the sequence of failed jobs. This implies to somewhat reimagine the workflow. A few changings that I foresee:

Submit the job. The last should be kept in memory so that the user could grab it by typing something like last_job(). This is important b/c users can use the collect = TRUE option and still be able to access to the last job.
Have a function called read_job or something like that that allows recovering any job. For this, when saving the auxiliary files we should be saving either a plain text file or a binary rds file with the job call itself so that users can recover a job set up by simply typing the path to the Slurm job folder.
For debugging, one thing I find myself doing all the time is:
- Looking the job directory, and in particular, reading the log files generated by Slurm.
- parallel::mclapply has a tryCatch wrapper, so some times I cannot tell whether a job has failed or not, so what I usually do is load one of the datasets and look at the data directly. In such cases mclapply usually returns a warning.
I could try to tag the errors automatically and have a function, as mentioned earlier, to try to resubmit the job, but only a subset (in this case, those that failed). Need to check how to modify the ARRAY variable in the batch file, and what is the limit on that string for Slurm. That should be in MaxArraySize in the slurm.conf file.

millstei commented 5 years ago

Maybe your #2 covers this? I just ran a bunch of jobs, some of which failed for unknown reasons, the 02-output-* files were not generated for the failed jobs. When I tried to rerun some of those jobs, individually, according to the job#, I was not able to recreate the error/failure, that is, the jobs completed successfully the second time around even though I used the same random seed. The problem is that now I cannot debug because I cannot regenerate the error.

gvegayon commented 5 years ago

OK, so part of this was implemented in 1dccce5dad51bf18324ce12547dfaae40d36b17f. Now jobs can be resubmitted very easily:

library(sluRm)

# A simple expr evaluation, WhoAmI() gives some info about the node
x <- Slurm_EvalQ(sluRm::WhoAmI(), njobs = 20, plan = "wait")

# Suppose jobs 1, 2, and 15-20 failed, then we can do the following
sbatch(x, array = "1,2,15-20")

# And, if status OK, then collect safely the entire array
ans <- Slurm_collect(x)

To keep it simple, users can check the status of a given submission with the state function, or simply calling sacct. state will return an integer scaler telling whether the job has been submitted, is done, it has failed, or is still running, including a set of attributes that enumerate the state of each job in the array. So I think this is done :).

USCbiostats / slurmR

Smooth way to re-run of failed jobs #2