dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

Feature Request: Restarting Partway Through a Step #468

Open alexkrohn opened 2 years ago

alexkrohn commented 2 years ago

It would be really nice to be able to restart ipyrad from the middle of a step rather than having to restart the entire step.

This is probably most important on Step 3. For example, I was running ipyrad on a large dataset. Steps 1-2 took a few hours, but Step 3 took > 7 days on my machine. When clustering (the longest action in Step 3) was ~85% done after about 6 days of calculations, the power went out.

To my knowledge, in order to restart that ipyrad run, I would have to give -s 34567 with the same params file as before. That would restart Step 3 from the beginning. Given that all of the 85% complete tmp and cluster files are already there, it would be great if there was enough information in the json to just restart Step 3 where it failed, rather than from the beginning.

Thanks!

isaacovercast commented 2 years ago

Thanks for the suggestion, and yes, i agree that this would be nice. It would be a not insignificant amount of work, unfortunately, so we have not prioritized it.

alexkrohn commented 2 years ago

That's what I figured. I figured I'd at least make an official suggestion :-)

alexkrohn commented 1 year ago

Adding my voice back in again for checkpointing at step 3. This time I'm running a salamander 3RAD dataset of 125 individuals where most individuals have 10-20 million reads, but one individual has 230 million. Most individuals finished clustering in 7 days (with 500 GB RAM, 40 cores, 3 TB of allocated hard drive space), but that largest individual is still clustering after 15 days. I'm coming up close to the deadline for this analysis. It would be wonderful to be able to restart the run with all individuals minus the 230-million-read-individual, without having to re-do the 7 days of clustering.

isaacovercast commented 1 year ago

Oh man, I feel your pain. This is still a good idea, I am with you on that. At the moment I still can't make any promises (it's faculty job season), but I will try to think about whether there's a quick and dirty way to hotwire it.

You have probably already thought of this, but an alternative is to carve off the first 20 million reads from the huge sample, and just go forward with that. The extreme amount of extra reads for that one sample isn't really going to do much for you except burn cpu time (10-20 million reads per sample is a LOT). I know it would be throwing away data, but I do not think it would meaningfully change any of the results. Just an idea.

alexkrohn commented 1 year ago

I totally understand that this would be a significant lift. I figured I'd just put it out there that this would still be a useful feature 😃

With a genome likely bigger than 50 Gb, 10-20 million reads per individual rarely gets us more than 1-2k orthologous loci 😅 We're actually using this big run to design baits around SNPs of interest to cut down on the sequencing depth needed. Since clustering the final individual is only taking up 1-2 cores, I'll probably start another run with the first 20 million reads of the huge individual somewhere else.