checkpointing / breaking up job

jo-mc commented 4 years ago

I am attempting to run correction on human Genome, on the university HPC, however our run times are limited to maximum 3 days. Is there a way to configure to run in batches? Our HPC supports checkpointing - and am looking into if this is possible - which saves state of job and resumes.. (I have requested extra time but could be in for a long wait) I have 150G nanopore reads and 240G of illumina reads. Also if a resume feature was available for program interruptions, out of time or memory, etc, that would be helpful.

GuillaumeHolley commented 4 years ago

Hi @jo-mc,

First of all, I apologize for the delay of my answer. For some reasons, I didn't see your question right away.

For now, it is not possible to run Ratatosk in batches. In the current reference-guided SLURM script I provide, small bin corrections require 2 days wall-clock time but in reality, I expect them to take no more than half a day (for a human genome 50x coverage ONT and 60x coverage Illumina) with 24 threads. The real problem is with the last "ambiguous/unknown" bin correction which usually has a lot of reads to correct and hence, takes a long time. For this one, I require 7 days and 48 threads but as for the other bins, I expect it to finish in no more than 2 days. So right now, given the 3 days limit on your cluster, I advise you to modify the script to use 3 days rather than 7 on the final correction and it should work without problems.

Ultimately, I think it should be possible to have a check-point between the 1st and 2nd correction pass in Ratatosk such that you could stop after the 1st correction pass and start a new job for the 2nd correction pass. I don't think it should be too difficult to implement. Is that interesting for you?

jo-mc commented 4 years ago

thanks, looking into it.

just not sure on how to get slurm to checkpoint after first pass (have not used checkpoint before, I thought it could only be done by time me) I have asked our hpc team for advice.

Joe.

On Thu, 15 Oct 2020, 9:14 pm Guillaume Holley, notifications@github.com wrote:

Hi @jo-mc https://github.com/jo-mc,

First of all, I apologize for the delay of my answer. For some reasons, I didn't see your questionright away.

For now, it is not possible to run Ratatosk in batches. In the current reference-guided SLURM script I provide, small bin corrections require 2 days wall-clock time but in reality, I expect them to take no more than half a day (for a human genome 50x coverage ONT and 60x coverage Illumina) with 24 threads. The real problem is with the last "ambiguous/unknown" bin correction which usually has a lot of reads to correct and hence, takes a long time. For this one, I require 7 days and 48 threads but as for the other bins, I expect it to finish in no more than 2 days. So right now, given the 3 days limit on your cluster, I advise you to modify the script to use 3 days rather than 7 on the final correction and it should work without problems.

Ultimately, I think it should be possible to have a check-point between the 1st and 2nd correction pass in Ratatosk such that you could stop after the 1st correction pass and start a new job for the 2nd correction pass. I don't think it should be too difficult to implement. Is that interesting for you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DecodeGenetics/Ratatosk/issues/17#issuecomment-709150776, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJHB5D4VUP6KPXNV5F2N3R3SK3HAZANCNFSM4SJT2DWQ .

GuillaumeHolley commented 4 years ago

Hi @jo-mc,

Just to let you know that the new version of Ratatosk to come will include the possibility to run the 1st and 2nd correction passes separately. So if you run out of time during the 2nd correction pass, you can restart it using the output of the 1st correction pass.

GuillaumeHolley commented 3 years ago

Hi @jo-mc,

The new version of Ratatosk allows to split the correction in 2 or more steps. All of it is described in the README. Let me know if you have questions or run into troubles with this new version.

Guillaume

DecodeGenetics / Ratatosk

checkpointing / breaking up job #17