Experiment with DMTCP on CSD3

ianhinder commented 4 years ago

BICI does not have built-in checkpointing. On CSD3, the walltime limit for a job in the normal partition is 1.5 days. The documentation, https://docs.hpc.cam.ac.uk/hpc/user-guide/long.html, says that DMTCP is available, and provides examples. Try it out, and see if we can use it to spread a BICI run over multiple jobs. Start with serial, then go on to MPI.

alahiff commented 4 years ago

My first tests running BICI based on the DMTCP examples in https://docs.hpc.cam.ac.uk/hpc/user-guide/long.html have been successful. Both serial and MPI versions work. If a BICI job is restarted after being killed once it continues where it left off and the output files generated are the same as when BICI is run to completion without DMTCP.

The examples in the CSD3 user guide and GitHub page require the user to manually resubmit a job if it was killed the first time. I'm written an improved version of a SLURM submission script which will:

initiate a checkpoint just before the job's walltime is reached
resubmit itself to SLURM (with the same job id)
repeat the above as necessary until the job has completed

This means that from the user's point of view they just submit the job once and forget about it. The job will run until completion, no matter how much longer the job's full runtime is compared to the walltime limit.

I've tested this with a short walltime limit, such that the BICI needs to run 4 times in total in order to complete. The output files were identical to when run without DMTCP.

I still want to do some more tests to make sure it works reliably, and still need to fix a few things.

ianhinder commented 4 years ago

Excellent! Thanks very much! Can you put it on a branch (even if it's work-in-progress)?

alahiff commented 4 years ago

I will once I get things working again - for some reason as soon as I updated this ticket the scripts I was running all stopped working and give bus errors now :-( Still trying to sort this out...

alahiff commented 4 years ago

I've sorted out the problem with bus errors I was having - after setting a unique TMPDIR per job the problem disappeared.

I've added a branch called "dmtcp" containing a new submission script submit-bici-dmtcp. If the job is still running 200s before the walltime is reached, it will create a checkpoint and resubmit itself and resume where it left off. This can happen multiple times if necessary.

Note that when BICI runs MPI_Finalize at the end there is an error:

[cli_0]: write_line error; fd=7 buf=:cmd=finalize
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(435).....: MPI_Finalize failed
MPI_Finalize(346).....: fail failed
MPID_Finalize(216)....: fail failed
MPIDI_PG_Finalize(141): PMI_Finalize failed, error -1
[cli_0]: write_line error; fd=7 buf=:cmd=abort exitcode=806978831
:
system msg for write_line failure : Bad file descriptor

This doesn't appear to be causing any problems, but I'm not sure yet what's causing it.

BTW I removed the -a option to cp in the submission script when it copies the BICI executable. When I run the original submit-bici on DiRAC it gives an Operation not permitted error.

ianhinder commented 4 years ago

Thanks! I'll give it a try. The MPI_Finalize error is a bit odd; I haven't seen that before. I got the Operation Not Permitted as well when I ran simulations in my home directory. It's a bit weird that you can't set the permissions. It went away when I ran simulations on the RDS directory. I wasn't sure if the script needed to be executable, which is why I didn't remove the -a. If it works, then that's good 🙂

ScottishCovidResponse / SCRCIssueTracking

Experiment with DMTCP on CSD3 #99