Restart get_holomogue from blastp results

eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis

Other

110 stars 26 forks source link

Restart get_holomogue from blastp results #116

Closed vincebaby6 closed 5 months ago

vincebaby6 commented 6 months ago

Hello there,

I have been running get_homologue with a large volume of sequences (~180 genomes) which took a lot of time. Near the end of the blastp runs, I had a core dump that corrupted one of the blastp which eventually prevented the pipeline to complete its run. I was wondering if there was a way to restart get_homologue from the blastp runs instead of having to start all over (which would take a lot of time and in which another core dump will possibly occur). Also, in the case it is possible, should I restart the corrupted blastp manually before restarting the pipeline of is it possible to rerun it within get_homologue?

Thank you very much in advance!

Vincent

eead-csic-compbio commented 6 months ago

Hi @vincebaby6 , indeed that's a large set of genomes, which will require 180^2 BLASTP jobs. Your question seems to be related to issue https://github.com/eead-csic-compbio/get_homologues/issues/44

What I would do is to

check the size of the resulting .blast.gz files, for instance with ls -ltrh
remove those that seem to be incomplete
depending on where the job failed, you might want to remove also folder tmp/

After this you should be able to re-run the original get_homologues.pl command, from the original location, and continue the job picking up the failed BLASTP jobs only.

When the BLAST jobs are done there are still major tasks pending which will run faster in parallel if you have a HPC cluster (-m cluster) and which might require significant RAM (-s) when clusters are to be computed. Please let us know how it goes, Bruno

vincebaby6 commented 6 months ago

Thank you very much for your answer Bruno!

Is there a reason I sould not remove the tmp/ folder? Does its presence simply means that get_homologue did not delete it because or the crash? and that in that case, there is no risk in deleting it?

Thanks in advance!

Vincent

eead-csic-compbio commented 6 months ago

Thank you very much for your answer Bruno!

Is there a reason I sould not remove the tmp/ folder? Does its presence simply means that get_homologue did not delete it because or the crash? and that in that case, there is no risk in deleting it?

Thanks in advance!

Vincent

The tmp/ folder contains files that will be re-used when possible in other tasks down the line. If they are correct you should let the program use them as that will speed up the analysis. However in your case it might be safer to remove it so that those files are recomputed with the correct underlying BLASTP results

vincebaby6 commented 6 months ago

Thank you again! I followed your instructions and everything ran smoothly! I did not take any chance and I removed the tmp/ folder before relaunching get_homologue.

Thank you very much!

Vincent