eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
110 stars 26 forks source link

dryrun.txt file has the same lines for multiple times #80

Closed carolynzy closed 3 years ago

carolynzy commented 3 years ago

Hi, I noticed that I had a very large dryrun.txt file, with more than 880k lines. I have checked with some samples IDs and found that some lines were exactly the same. For example, the following five lines are exactly the same, but appeared in the file at different places( as indicated by the number in the front):

9743:/root/get_homologues-x86_64-20210305/_cluster_makeHomolog.pl -d /mnt/gbk1_homologues -b /mnt/gbk1_homologues/tmp/all.bpo -i 14123.gbk -j 12359.gbk -E 1e-05 -S 0 -C 20 -f 0 114724:/root/get_homologues-x86_64-20210305/_cluster_makeHomolog.pl -d /mnt/gbk1_homologues -b /mnt/gbk1_homologues/tmp/all.bpo -i 14123.gbk -j 12359.gbk -E 1e-05 -S 0 -C 20 -f 0 328807:/root/get_homologues-x86_64-20210305/_cluster_makeHomolog.pl -d /mnt/gbk1_homologues -b /mnt/gbk1_homologues/tmp/all.bpo -i 14123.gbk -j 12359.gbk -E 1e-05 -S 0 -C 20 -f 0 437436:/root/get_homologues-x86_64-20210305/_cluster_makeHomolog.pl -d /mnt/gbk1_homologues -b /mnt/gbk1_homologues/tmp/all.bpo -i 14123.gbk -j 12359.gbk -E 1e-05 -S 0 -C 20 -f 0 731166:/root/get_homologues-x86_64-20210305/_cluster_makeHomolog.pl -d /mnt/gbk1_homologues -b /mnt/gbk1_homologues/tmp/all.bpo -i 14123.gbk -j 12359.gbk -E 1e-05 -S 0 -C 20 -f 0

Is there something wrong with my dryrun.txt file?

brunocontrerasmoreira commented 3 years ago

That does not look good, that file should be rewritten every run. It might be a bug, but might also be due to incomplete previous results?

brunocontrerasmoreira commented 3 years ago

I suggest you start off with 10 genomes and see if you can reproduce the error

carolynzy commented 3 years ago

I have deleted the tmp folder completely before this run. So it doesn't not caused by the previous run. However, I will try with 10 samples this time. Thank you!

carolynzy commented 3 years ago

I have tested with 10 samples and the dryrun.txt file seems fine. I don't what went wrong in the previous run. But I will try my luck and run again with my 420 samples.

vinuesa commented 3 years ago

Start out with a clean directory structure - delete the my_gbks_homologues directory before starting over. Keep us posted.

carolynzy commented 3 years ago

Hi, @vinuesa , I have removed the tmp folder and repeated the dryrun step. But still I had the same dryrun.txt file. So, if I understand correctly, I should remove all the other files except the blastn output .blast.gz files, right? If I did so, there will be 420420= 176400 *.gz files and one "input_order.txt" file in my gbks homologues folder. Then I should start the dryrun step from this point. I just want to make sure I understand you correctly. Thank you!

Another question, it seems that I cannot run with multiple threads in the dryrun step, despite I used the -n 63 parameter in the command line. Is there a way to make this possible?

carolynzy commented 3 years ago

Just some additional information, when I ran with 10 samples, the dryrun.txt file seemed just fine; but when I did the same thing with 420 samples, then there was a problem. I couldn't figure out why.

brunocontrerasmoreira commented 3 years ago

You mean the problem is still there after repeating? The problem with interrupted runs is that you might end up with faulty/empty Blast files. You could remove all those or the whole _homologues folder as suggested by @vinuesa

brunocontrerasmoreira commented 3 years ago

Hi, @vinuesa , I have removed the tmp folder and repeated the dryrun step. But still I had the same dryrun.txt file. So, if I understand correctly, I should remove all the other files except the blastn output _.blast.gz files, right? If I did so, there will be 420_420= 176400 *.gz files and one "input_order.txt" file in my gbks homologues folder. Then I should start the dryrun step from this point. I just want to make sure I understand you correctly. Thank you!

Another question, it seems that I cannot run with multiple threads in the dryrun step, despite I used the -n 63 parameter in the command line. Is there a way to make this possible?

You are right, -n and dryrun are not compatible, as dryrun does not run any jobs, it just creates the file with commands to be run

carolynzy commented 3 years ago

@vinuesa @brunocontrerasmoreira Hi, I have followed your advice to start from the beginning. I have deleted the whole gbk1_homologues folder and repeated all the steps. However, I still got the same dryrun.txt file with 884100 lines.

I looked into the details and found that there are lines repeated for at most 10 times and 1933 lines are unique. In total, there were 176665 lines after removing duplicates. I have 420 samples and 1 reference genome in the gbk1 folder. So I guess there should be 176820 unique lines in the dryrun file (please correct me if I'm wrong). So I investigated which samples have less lines than they should. There were 126 samples which presented in less than 840 lines. The sample with the the least number was sample 70(821 lines). I checked the gbk file of this sample, which seemed fine to me.

I'm not sure what I should do next. So I have uploaded the reference file, sample 1, sample 70 and sample 4128 here. And my command lines as below:

get_homologues.pl -d gbk1 -m dryrun -c -A -t 0 -P -z -G parallel -j -1 < gbk1_homologues/dryrun.txt

I'm new to this software. Maybe I have made some mistakes I'm not aware of. Thanks a lot for your time and help!

brunocontrerasmoreira commented 3 years ago

Thanks for your report, can you confirm this dryrun.txt file is the first one created, before any BLAST jobs have been run? I guess at this point I would like to reproduce your error and correct the code if needed. For that I would need to use your input, can you please provide a URL where I can download it? Thanks, Bruno

carolynzy commented 3 years ago

@brunocontrerasmoreira Thank you! This dryrun.txt file is the one created after blastp was done. So I had all the *.blast.gz files in my gbk1_homologues folder and then I ran the get_homologues.pl -m dryrun step again. The dryrun.txt file was created after this step. This first line in the file is like this:

/root/get_homologues-x86_64-20210305/_cluster_makeHomolog.pl -d /mnt/gbk1_homologues -b /mnt/gbk1_homologues/tmp/all.bpo -i 07MDR10138.gbk -j 07MDR10133.gbk -E 1e-05 -S 0 -C 20 -f 1

The sample files can be downloaded at this link:

https://1drv.ms/u/s!Ahu3aHGoa85BhA4sgPBmxY-DyT2t?e=hLwUJe

brunocontrerasmoreira commented 3 years ago

There are only 4 sample files, is that correct?

brunocontrerasmoreira commented 3 years ago

I will try to have a look at the end of the day, but I would prefer to have a larger test set

brunocontrerasmoreira commented 3 years ago

While you respond to my questions can you please share your dryrun.txt file? Thanks

brunocontrerasmoreira commented 3 years ago

I am now sure there's a bug when you do -m dryrun -c, which exaplins those repeated batch commands. I will fix it and post a corrected script.

brunocontrerasmoreira commented 3 years ago

Please see https://github.com/eead-csic-compbio/get_homologues/releases/tag/v3.4.3 or do git pull to fetch the new code. All your previous (BLAST) results should be reusable, the new code produces dryrun.txt files with no redundancy when using -c.

Since you have a large number of genomes I would recommend that you first run your analyses without -c to see how the job goes (and consider using -s if you have RAM problems). So your command chould be something like

$ perl get_homologues.pl -d gbk1 -m dryrun -A -t 0 -P -G

When that run completes you should be able to add -c -z if you wish, but note that runs $NOFSAMPLESREPORT = 10 simulations so that will take a lot longer even if it re-uses as many previous results as possible. Hope this helps, Bruno

carolynzy commented 3 years ago

@brunocontrerasmoreira , Thank you so much! I will try the new code right away.

carolynzy commented 3 years ago

@brunocontrerasmoreira Hi, I have tried the command

get_homologues.pl -d gbk1 -m dryrun -A -t 0 -P -G

and got the following message:

... # making COGs ERROR: find_COGs (/root/get_homologues-x86_64-20210305//bin/COGsoft/COGtriangles/COGtriangles ) failed to terminate job

The COGtriangles.log file shows that:

LSE completed Hitset completed

I don't know what went wrong. What should I do next?

brunocontrerasmoreira commented 3 years ago

This could be your job running out of RAM, as your dataset is larger than anything I have tried with bacteria before. Also, I have seen COGtriagnles failing before and then completing just fine on repetition, so I I suggest you repeat that step after removing those files from tmp/ and with more RAM if possible

brunocontrerasmoreira commented 3 years ago

You can also try -M instead of -G.

carolynzy commented 3 years ago

@brunocontrerasmoreira Thank you for your suggestions. I'm working on this as you suggested. I have one question about the dryrun mode.

According to the manual, the dryrun step should be repeated for several times. If I understand correctly, it should be run at least for three times for

_cluster_makeHomolog.pl, _cluster_makeInparalog.pl and _cluster_makeOrtholog.pl

Am I right?

If so, if something went wrong at one step after some other steps already finished, e.g. something went wrong at _cluster_makeHomolog.pl step, and _cluster_makeOrtholog.pl had already finished successfully, where should I resume? Should I delete all the files in the tmp folder and start from the beginning, or just delete the homologues_xxx.gbk_xxx.gbk files and repeat the dryrun and parallel step? Or something else would you suggest?

I'm asking this because for a large sample set, each step could take several days. It would be really convenient if I don't have to repeat all the steps.

Thank you very much!

brunocontrerasmoreira commented 3 years ago

In your case it should be enouhg to remove from tmp/ all the homolog files, but not the ortholog ones, does that make sense? Bruno

carolynzy commented 3 years ago

@brunocontrerasmoreira Yes, that makes sense. I have a follow up question.
I was stuck at the dryrun mode step, which probably was caused by running out of RAM ( I forgot to add the -s option to the command when repeat it). I have to turn off the computer otherwise it would stall forever. The dryrun.txt file was empty after I did this. Then I repeated the dryrun step which produced the dryrun.txt file. The first severa lines in the file looks like:

/root/get_homologues-x86_64-20210828/_cluster_makeHomolog.pl -d /mnt/gbk1_homologues -b /mnt/gbk1_homologues/tmp/all.bpo -i 07MDR10138.gbk -j 07MDR10133.gbk -E 1e-05 -S 0 -C 20 -f 1 /root/get_homologues-x86_64-20210828/_cluster_makeHomolog.pl -d /mnt/gbk1_homologues -b /mnt/gbk1_homologues/tmp/all.bpo -i 07MDR10142.gbk -j 07MDR10138.gbk -E 1e-05 -S 0 -C 20 -f 1 /root/get_homologues-x86_64-20210828/_cluster_makeHomolog.pl -d /mnt/gbk1_homologues -b /mnt/gbk1_homologues/tmp/all.bpo -i 07MDR10142.gbk -j 07MDR10133.gbk -E 1e-05 -S 0 -C 20 -f 1

The dryrun.txt file has 176665 lines, which is less than what I supposed it to be 176820 ( 421 * 421 - 421). I'm not sure this is normal. Is there anything I could do to check if everything is fine?

Thank you!

brunocontrerasmoreira commented 3 years ago

The code will skip jobs if the corresponding result files in tmp/ already exist. If you are not sure those existing files are complete you might want to remove all homologues files therein and relaunch.

PS: If launching this is only possible with option -s I suspect you might run out of RAM down the line, but you won't know until yo try, right?

Bruno

carolynzy commented 3 years ago

@brunocontrerasmoreira Yes, I got it. Thank you very much! You have been of great help.