Can I reuse the files generated by Step 2,3,4 in a new data set to save a little time in a re-analysis

jljy47 commented 1 year ago

Hello, I would like to ask whether the data files generated by step2,3,4, and 5 can be reused into other new processes.

I find that step3 always takes much more time in every data analysis, but in different processes (different data set but with several same samples), it seems that step 2,3, and 4 generate the same files for a single sample, and the main modification is the description of the single sample reads and other information in the json file.

Since you can continue the previous program by calling the json file, is there a way to reuse the files generated in Step 234 or 5 into a "new process" to save analysis time?

A new process is when I want to add some new data samples to the data set that has already been run after Step 6 and 7, or select a portion of the finished data and combine it with some new sample data to create a new data set.

Sorry for this opportunistic idea, but I wonder if it can speed up the analysis process, cause Step 3 always takes me many days every time (at least 7 days for even 40 samples for ddRAD data analysis and creates too many large files that I can't bear to delete immediately but inconvenient to save.)

However I know all this is the needed for the science~And thank you very much for developing this very user-friendly analysis software~~~

isaacovercast commented 1 year ago

Hello @jljy47,

What you suggest is certainly a reasonable idea, reusing the files associated with samples already run through steps 1-5 in a different assembly. In theory this is a great idea, but in practice it would require some re-engineering of the internals of ipyrad (because we make a lot of assumptions about the continuity of samples withing a given assembly). Our general philosophy is that disk and cpu are cheap and stability and consistency are important, so while this seems like a nice idea, I don't think we'll ever build this in as a feature, particularly because it seems to be somewhat of an edge case. Thanks for sharing this idea though!

Also, thank you for the positive feedback! Glad you find ipyrad useful and user-friendly!

jljy47 commented 1 year ago

Wow, thank you so much for your response~~

I want to add two more questions. So as for the generated intermediate data files :

After the completion of Step 7, are they useless? For example, if I need to continue running again, do I still need to keep those data files and folders (trimmed, 0.85...)of these parts, or only keep the json file is OK?

Like the situations in "Analysis tools" - In [7] "re-load" part of this page:

https://ipyrad.readthedocs.io/en/latest/9-tutorials.html

Recently, I interrupted my data at 97% of Step 3. If I want to run again at this time, the program reports an error. The intermediate files generated seems that can not be used all ( and seems to be the reason for the error), is it the only way to delete all files, and then start them from Step1 again?

Thanks again for your help and answers.

isaacovercast commented 1 year ago

1) You can remove all the temporary files except for the _consens and _across directories, which are required for re-running step 7.

2) Without seeing the error message it's impossible for me to say what actually happened. You should be able to restart from step 3. If you show me the error message you are talking about it might help diagnose the problem a bit better.

jljy47 commented 1 year ago

Thank you for your prompt reply. I'm sorry that sometimes I can't log in to github so I can't reply to you in time.
The error messages are listed in this Online document. ipyrad bugs in my Step 3 ipyrad bugs in my Step 3
But I'm more concerned about this:

Is the process speed of Step3 related to sample similarity?

My friend put in 280 samples of two species and completed Step3 in only 2 days. My data set includes 45 samples from 20 species, and it is still 97% after 10 days of running.

Is there any possible to increase the speed?

Thank you ~~

isaacovercast commented 1 year ago

Step 3 speed is not related to similarity so much as it is to the properties of the data. Each dataset is unique and the amount of time it will take to run is unpredictable, so just let it keep going.

jljy47 commented 1 year ago

Step 3 speed is not related to similarity so much as it is to the properties of the data. Each dataset is unique and the amount of time it will take to run is unpredictable, so just let it keep going.

Thank you so much for your help and reply! ;)

dereneaton / ipyrad

Can I reuse the files generated by Step 2,3,4 in a new data set to save a little time in a re-analysis #522