AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
142 stars 37 forks source link

Speed, reproducibility issues with hmmsearch #12

Closed brymerr921 closed 4 years ago

brymerr921 commented 4 years ago

Hi VIBRANT developers,

First off, thanks for putting together such a great tool. I'm very excited to use it and think it will be really helpful in my research.

I'm noticing that when I run VIBRANT twice on the example dataset, some of the results change. I don't know whether this has an impact on the final output or not, but I think hmmsearch should return the same results when run multiple times. Beyond the files listed below, I haven't yet done any further investigation.

I installed VIBRANT using conda install -c bioconda vibrant and got version 1.0.1. Next, I cloned this repo into my working directory and then ran:

time VIBRANT_run.py -i VIBRANT/example_data/mixed_example.fasta -t 6 -folder test_VIBRANT_out

This took me ~2.5 minutes to run.

I ran the exact same command again, specifying a different output folder, which again took ~2.5 minutes:

time VIBRANT_run.py -i VIBRANT/example_data/mixed_example.fasta -t 6 -folder test_VIBRANT_out2

However, I then did the following to see how many parsed HMM hits I got for each run, compared with the example data provided in this GitHub repo. I ran:

wc -l test_VIBRANT_out*/*/*parsed*/* VIBRANT/*/*/*/*parsed*/*

and got:

   29 test_VIBRANT_out/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.KEGG_hmmtbl_parse.tsv
   20 test_VIBRANT_out/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.Pfam_hmmtbl_parse.tsv
   32 test_VIBRANT_out/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.VOG_hmmtbl_parse.tsv
    2 test_VIBRANT_out/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.plasmid_hmmtbl_parse.tsv
    1 test_VIBRANT_out/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv

   29 test_VIBRANT_out2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.KEGG_hmmtbl_parse.tsv
   20 test_VIBRANT_out2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.Pfam_hmmtbl_parse.tsv
   32 test_VIBRANT_out2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.VOG_hmmtbl_parse.tsv
    2 test_VIBRANT_out2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.plasmid_hmmtbl_parse.tsv
    3 test_VIBRANT_out2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv

   29 VIBRANT/example_data/example_output/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.KEGG_hmmtbl_parse.tsv
   20 VIBRANT/example_data/example_output/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.Pfam_hmmtbl_parse.tsv
   32 VIBRANT/example_data/example_output/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.VOG_hmmtbl_parse.tsv
    1 VIBRANT/example_data/example_output/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.plasmid_hmmtbl_parse.tsv
    1 VIBRANT/example_data/example_output/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv

The number of plasmid and vpfam hits is not be. I don't know enough about hmmsearch to know if this is intended behavior, or not.

In other projects I have implemented hmmsearch similar to how you do it, because it is much faster to break input files into n pieces and run n instances of hmmsearch with one CPU in parallel. However, I noticed that in my earlier projects and also here, VIBRANT spins up 6 instances of hmmsearch --cpu 1. However, when I check resource usage via htop, ~12 threads are busy. I think this is because here, hmmsearch --cpu means additional CPUs. For reference:

Usage: hmmsearch [options] <hmmfile> <seqdb>
 --cpu <n>     : number of parallel CPU workers to use for multithreads  [2]

Therefore, setting --cpu 1 in hmmsearch means each process occupies 2 threads. Surprisingly, I've seen siginificant speedups when breaking my input files and feeding to hmmsearch when I specify hmmsearch --cpu 0 even without spawning additional parallel hmmsearch processes. Spawning additional parallel hmmsearch --cpu 0 processes makes things go even faster. I was interested as to whether I could similarly speed up VIBRANT (I have many enormous assemblies), and ran this command to change --cpu 1 to --cpu 0 in VIBRANT_annotation.py.

sed -i "s/--cpu', '1'/--cpu', '0'/g"  /opt/conda/bin/VIBRANT_annotation.py

I then ran these two commands, which only took 52 seconds (wow!!!):

time VIBRANT_run.py -i VIBRANT/example_data/mixed_example.fasta -t 6 -folder test_VIBRANT_out_FAST
time VIBRANT_run.py -i VIBRANT/example_data/mixed_example.fasta -t 6 -folder test_VIBRANT_out_FAST2

and compared the outputs of the parsed hmmsearch files:

   29 test_VIBRANT_out_FAST/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.KEGG_hmmtbl_parse.tsv
   20 test_VIBRANT_out_FAST/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.Pfam_hmmtbl_parse.tsv
   32 test_VIBRANT_out_FAST/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.VOG_hmmtbl_parse.tsv
    1 test_VIBRANT_out_FAST/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.plasmid_hmmtbl_parse.tsv
    1 test_VIBRANT_out_FAST/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv

   29 test_VIBRANT_out_FAST2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.KEGG_hmmtbl_parse.tsv
   20 test_VIBRANT_out_FAST2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.Pfam_hmmtbl_parse.tsv
   32 test_VIBRANT_out_FAST2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.VOG_hmmtbl_parse.tsv
    2 test_VIBRANT_out_FAST2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.plasmid_hmmtbl_parse.tsv
    6 test_VIBRANT_out_FAST2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv

The results are close to, but not identical with the example files.

Anyways, hopefully this is helpful for troubleshooting the reproducibility issue, and maybe gives a window into unlocking some potentially large speed improvements to VIBRANT in the hmmsearch steps. Since specifying hmmsearch --cpu 0 makes each subprocess stick to only one CPU (instead of 2 in VIBRANT's current implementation), I could realistically do VIBRANT_run.py -t 16 on a 16 core machine and hmmsearch woudn't be trying to operate on 32 cores.

Finally, if I understand correctly, hmmsearch calculates e-values based on the # of sequences being searched and the # of entries in the database. Splitting up the input and running them separately is going to calculate e-values for each chunk differently, and not based on the input dataset as a whole. This may or may not be an issue at all (and might not be the cause behind the changes in output between identical runs), but one option for each hmmsearch run is to specify hmmsearch -Z and set Z to be the # of queries in the entire input * # of entries in the specific HMM database. (see comment from Sean Eddy here.)

Thanks, and please let me know if you have any questions. Bryan Merrill

KrisKieft commented 4 years ago

Hi Bryan,

Thank you for the great insight!

First, I don't consider e-values when selecting annotations, rather I use scores. Scores will be consistent because they depend on the HMM profile rather than the database/dataset size. So regardless of how many parallel runs or how many input proteins, each protein should receive the same hmmsearch score per HMM. This is because, as you mentioned, e-values differ between runs when using hmmsearch but they should be consistent when using hmmscan (VIBRANT does not use hmmscan, again for speed).

Second, I'm not sure what reproducibility issue you're running into. Not that I don't believe you, but if I run the same dataset multiple times with different parallel runs I still get the same number of annotations. For example, I get zero "vpfam" hits for each run. Maybe there's just something odd happening with wc -l and each file has different amounts of blank lines. You should open up those files and see if there are actually hits. If there are indeed different number of hits then please let me know.

Third, I will definitely check out the --cpu 0 option. If you give VIBRANT -t 6 then VIBRANT will break the input file into 6 separate files to each run hmmsearch --cpu 1 to increase speed rather than running hmmsearch --cpu 6 since hmmsearch doesn't multi-thread well. If --cpu 0 will make this even faster then I'd be happy! hmmsearch is the bottleneck of VIBRANT's speed. I just ran some runtime tests so I can do a comparison with --cpu 1 vs 0.

Thanks again for taking the time to flush out any potential issues or potential improvements. Let me know what you come up with for the reproducibility issue and if you have any other questions.

Kris

brymerr921 commented 4 years ago

Hi Kris,

Here's the text output from the vpfam tables. FAST and FAST2 are identical runs with all hmmsearch steps set to --cpu 0. SLOW and SLOW2 are identical runs with all hmmsearch steps set to --cpu 1 (default).

test_VIBRANT_out_FAST2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv
protein id      evalue  score
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_12_-1  PF00580.21      1.5999999999999998e-77  252.9
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_6_-1   PF00580.21      3.699999999999999e-64   209.0

test_VIBRANT_out_FAST/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv
protein id      evalue  score
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_12_-1  PF00580.21      7.1999999999999995e-78  252.9
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_6_-1   PF00580.21      1.6999999999999997e-64  209.0
MH552510.2$~&Microviridae$~&sp.$~&isolate$~&ctch16,$~&complete$~&genome_1_1     PF02305.17      3.9e-21 66.0

test_VIBRANT_out_SLOW2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv
protein id      evalue  score
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_12_-1  PF00580.21      2.9999999999999994e-77  252.9
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_6_-1   PF00580.21      6.799999999999999e-64   209.0
MH552500.1$~&CrAssphage$~&sp.$~&isolate$~&ctcc615,$~&complete$~&genome_36_-1    PF13604.6       1.3e-23 76.5
MH552500.1$~&CrAssphage$~&sp.$~&isolate$~&ctcc615,$~&complete$~&genome_10_-1    PF00692.19      8.7e-19 60.3
MH552500.1$~&CrAssphage$~&sp.$~&isolate$~&ctcc615,$~&complete$~&genome_12_-1    PF05565.11      2.2000000000000002e-18  59.6

test_VIBRANT_out_SLOW/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv
protein id      evalue  score
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_12_-1  PF00580.21      5.499999999999999e-78   252.9
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_6_-1   PF00580.21      1.2999999999999998e-64  209.0

I've attached a gzip archive of all 4 output directories. You can check the log directory for the exact command used to generate each output. It totally could be something weird with my system too. Happy to try any other suggestions you have!

Best, Bryan VIBRANT_tests.tar.gz

KrisKieft commented 4 years ago

Hi Bryan,

Fortunately I found the issue and it's easy to fix. Unfortunately it might not be the best solution or fix that you want to do. I ran VIBRANT on both a Linux server running HMMER v3.1b as well as my MacBook Pro running HMMER v3.2.1. Version 3.2.1 caused inconsistencies in the hmmsearch outputs while 3.1b remained consistent. I'm not sure which version produces the "correct" answer, but I'd be more trustworthy of the consistent v3.1b. This may be because the Pfam HMMs were compiled using v3.1 whereas the KEGG and VOG HMMs were compiled using 3.2.1. This seems like a reasonable assumption since the inconsistencies are noticeable in the Pfam HMM outputs. You can check your version with hmmsearch -h and looking at the top of the help page. Identifying which version will produce the most accurate results will take me some time, so I'll get back on what I find. Meanwhile, downgrading versions to 3.1b (2015 release) may solve your issues. I imagine many others have more current versions (3.2, 3.2.1 or 3.3) and may also be experiencing this without noticing. Let me know what you think.

Kris

brymerr921 commented 4 years ago

Hi Kris,

I was using HMMER v3.3. I'll try downgrading to 3.1b and let you know how it goes.

Best, Bryan

On Thu, Feb 6, 2020, 6:14 PM Kris Kieft notifications@github.com wrote:

Hi Bryan,

Fortunately I found the issue and it's easy to fix. Unfortunately it might not be the best solution or fix that you want to do. I ran VIBRANT on both a Linux server running HMMER v3.1b as well as my MacBook Pro running HMMER v3.2.1. Version 3.2.1 caused inconsistencies in the hmmsearch outputs while 3.1b remained consistent. I'm not sure which version produces the "correct" answer, but I'd be more trustworthy of the consistent v3.1b. You can check your version with hmmsearch -h and looking at the top of the help page. Identifying which version will produce the most accurate results will take me some time, so I'll get back on what I find. Meanwhile, downgrading versions to 3.1b (2015 release) may solve your issues. I imagine many others have more current versions (3.2, 3.2.1 or 3.3) and may also be experiencing this without noticing. Let me know what you think.

Kris

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AnantharamanLab/VIBRANT/issues/12?email_source=notifications&email_token=ABZE36VOHRFSCN4YKKWTP23RBS7YLA5CNFSM4KRECEVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELBPK6Q#issuecomment-583202170, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZE36SA5KYUZBZHWUVYV7LRBS7YLANCNFSM4KRECEVA .

KrisKieft commented 4 years ago

Hi Bryan,

I believe I was wrong that HMMER version was the issue. Either way, I identified a different issue that was causing inconsistencies in the results and it may explain what you are encountering. I'll be pushing v1.1.0 to resolve the issue. Also, different versions of HMMER have different usages of --cpu. The newer version is compatible with --cpu 0 but the older version is not. The update will also take this into account and use either --cpu 0 or --cpu 1 depending on the version of HMMER that you're running (VIBRANT will detect the version). I'm going to close this issue because I believe I have resolved it, but I'll monitor this to make sure that is the case.

Kris

brymerr921 commented 4 years ago

Hi @KrisKieft, thanks so much for working on this! I'll look for version v1.1.0 in bioconda and give it a shot. Much appreciated! Thanks again for the awesome tool.