Closed brymerr921 closed 4 years ago
Hi Bryan,
Thank you for the great insight!
First, I don't consider e-values when selecting annotations, rather I use scores. Scores will be consistent because they depend on the HMM profile rather than the database/dataset size. So regardless of how many parallel runs or how many input proteins, each protein should receive the same hmmsearch score per HMM. This is because, as you mentioned, e-values differ between runs when using hmmsearch but they should be consistent when using hmmscan (VIBRANT does not use hmmscan, again for speed).
Second, I'm not sure what reproducibility issue you're running into. Not that I don't believe you, but if I run the same dataset multiple times with different parallel runs I still get the same number of annotations. For example, I get zero "vpfam" hits for each run. Maybe there's just something odd happening with wc -l
and each file has different amounts of blank lines. You should open up those files and see if there are actually hits. If there are indeed different number of hits then please let me know.
Third, I will definitely check out the --cpu 0
option. If you give VIBRANT -t 6
then VIBRANT will break the input file into 6 separate files to each run hmmsearch --cpu 1
to increase speed rather than running hmmsearch --cpu 6
since hmmsearch doesn't multi-thread well. If --cpu 0
will make this even faster then I'd be happy! hmmsearch is the bottleneck of VIBRANT's speed. I just ran some runtime tests so I can do a comparison with --cpu
1 vs 0.
Thanks again for taking the time to flush out any potential issues or potential improvements. Let me know what you come up with for the reproducibility issue and if you have any other questions.
Kris
Hi Kris,
Here's the text output from the vpfam tables. FAST
and FAST2
are identical runs with all hmmsearch
steps set to --cpu 0
. SLOW
and SLOW2
are identical runs with all hmmsearch
steps set to --cpu 1
(default).
test_VIBRANT_out_FAST2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv
protein id evalue score
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_12_-1 PF00580.21 1.5999999999999998e-77 252.9
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_6_-1 PF00580.21 3.699999999999999e-64 209.0
test_VIBRANT_out_FAST/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv
protein id evalue score
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_12_-1 PF00580.21 7.1999999999999995e-78 252.9
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_6_-1 PF00580.21 1.6999999999999997e-64 209.0
MH552510.2$~&Microviridae$~&sp.$~&isolate$~&ctch16,$~&complete$~&genome_1_1 PF02305.17 3.9e-21 66.0
test_VIBRANT_out_SLOW2/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv
protein id evalue score
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_12_-1 PF00580.21 2.9999999999999994e-77 252.9
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_6_-1 PF00580.21 6.799999999999999e-64 209.0
MH552500.1$~&CrAssphage$~&sp.$~&isolate$~&ctcc615,$~&complete$~&genome_36_-1 PF13604.6 1.3e-23 76.5
MH552500.1$~&CrAssphage$~&sp.$~&isolate$~&ctcc615,$~&complete$~&genome_10_-1 PF00692.19 8.7e-19 60.3
MH552500.1$~&CrAssphage$~&sp.$~&isolate$~&ctcc615,$~&complete$~&genome_12_-1 PF05565.11 2.2000000000000002e-18 59.6
test_VIBRANT_out_SLOW/VIBRANT_mixed_example/VIBRANT_HMM_tables_parsed_mixed_example/mixed_example.vpfam_hmmtbl_parse.tsv
protein id evalue score
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_12_-1 PF00580.21 5.499999999999999e-78 252.9
LBQB01000010.1$~&Candidate$~&division$~&CPR3$~&bacterium$~&GW2011_GWF2_35_18$~&UR67_C0010,$~&whole$~&genome$~&shotgun$~&sequence_6_-1 PF00580.21 1.2999999999999998e-64 209.0
I've attached a gzip archive of all 4 output directories. You can check the log directory for the exact command used to generate each output. It totally could be something weird with my system too. Happy to try any other suggestions you have!
Best, Bryan VIBRANT_tests.tar.gz
Hi Bryan,
Fortunately I found the issue and it's easy to fix. Unfortunately it might not be the best solution or fix that you want to do. I ran VIBRANT on both a Linux server running HMMER v3.1b as well as my MacBook Pro running HMMER v3.2.1. Version 3.2.1 caused inconsistencies in the hmmsearch outputs while 3.1b remained consistent. I'm not sure which version produces the "correct" answer, but I'd be more trustworthy of the consistent v3.1b. This may be because the Pfam HMMs were compiled using v3.1 whereas the KEGG and VOG HMMs were compiled using 3.2.1. This seems like a reasonable assumption since the inconsistencies are noticeable in the Pfam HMM outputs. You can check your version with hmmsearch -h
and looking at the top of the help page. Identifying which version will produce the most accurate results will take me some time, so I'll get back on what I find. Meanwhile, downgrading versions to 3.1b (2015 release) may solve your issues. I imagine many others have more current versions (3.2, 3.2.1 or 3.3) and may also be experiencing this without noticing. Let me know what you think.
Kris
Hi Kris,
I was using HMMER v3.3. I'll try downgrading to 3.1b and let you know how it goes.
Best, Bryan
On Thu, Feb 6, 2020, 6:14 PM Kris Kieft notifications@github.com wrote:
Hi Bryan,
Fortunately I found the issue and it's easy to fix. Unfortunately it might not be the best solution or fix that you want to do. I ran VIBRANT on both a Linux server running HMMER v3.1b as well as my MacBook Pro running HMMER v3.2.1. Version 3.2.1 caused inconsistencies in the hmmsearch outputs while 3.1b remained consistent. I'm not sure which version produces the "correct" answer, but I'd be more trustworthy of the consistent v3.1b. You can check your version with hmmsearch -h and looking at the top of the help page. Identifying which version will produce the most accurate results will take me some time, so I'll get back on what I find. Meanwhile, downgrading versions to 3.1b (2015 release) may solve your issues. I imagine many others have more current versions (3.2, 3.2.1 or 3.3) and may also be experiencing this without noticing. Let me know what you think.
Kris
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AnantharamanLab/VIBRANT/issues/12?email_source=notifications&email_token=ABZE36VOHRFSCN4YKKWTP23RBS7YLA5CNFSM4KRECEVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELBPK6Q#issuecomment-583202170, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZE36SA5KYUZBZHWUVYV7LRBS7YLANCNFSM4KRECEVA .
Hi Bryan,
I believe I was wrong that HMMER version was the issue. Either way, I identified a different issue that was causing inconsistencies in the results and it may explain what you are encountering. I'll be pushing v1.1.0 to resolve the issue. Also, different versions of HMMER have different usages of --cpu
. The newer version is compatible with --cpu 0
but the older version is not. The update will also take this into account and use either --cpu 0
or --cpu 1
depending on the version of HMMER that you're running (VIBRANT will detect the version). I'm going to close this issue because I believe I have resolved it, but I'll monitor this to make sure that is the case.
Kris
Hi @KrisKieft, thanks so much for working on this! I'll look for version v1.1.0 in bioconda and give it a shot. Much appreciated! Thanks again for the awesome tool.
Hi VIBRANT developers,
First off, thanks for putting together such a great tool. I'm very excited to use it and think it will be really helpful in my research.
I'm noticing that when I run VIBRANT twice on the example dataset, some of the results change. I don't know whether this has an impact on the final output or not, but I think
hmmsearch
should return the same results when run multiple times. Beyond the files listed below, I haven't yet done any further investigation.I installed VIBRANT using
conda install -c bioconda vibrant
and got version1.0.1
. Next, I cloned this repo into my working directory and then ran:This took me ~2.5 minutes to run.
I ran the exact same command again, specifying a different output folder, which again took ~2.5 minutes:
However, I then did the following to see how many parsed HMM hits I got for each run, compared with the example data provided in this GitHub repo. I ran:
and got:
The number of plasmid and vpfam hits is not be. I don't know enough about
hmmsearch
to know if this is intended behavior, or not.In other projects I have implemented
hmmsearch
similar to how you do it, because it is much faster to break input files into n pieces and run n instances ofhmmsearch
with one CPU in parallel. However, I noticed that in my earlier projects and also here, VIBRANT spins up 6 instances ofhmmsearch --cpu 1
. However, when I check resource usage viahtop
, ~12 threads are busy. I think this is because here,hmmsearch --cpu
means additional CPUs. For reference:Therefore, setting
--cpu 1
inhmmsearch
means each process occupies 2 threads. Surprisingly, I've seen siginificant speedups when breaking my input files and feeding tohmmsearch
when I specifyhmmsearch --cpu 0
even without spawning additional parallelhmmsearch
processes. Spawning additional parallelhmmsearch --cpu 0
processes makes things go even faster. I was interested as to whether I could similarly speed up VIBRANT (I have many enormous assemblies), and ran this command to change--cpu 1
to--cpu 0
inVIBRANT_annotation.py
.I then ran these two commands, which only took 52 seconds (wow!!!):
and compared the outputs of the parsed hmmsearch files:
The results are close to, but not identical with the example files.
Anyways, hopefully this is helpful for troubleshooting the reproducibility issue, and maybe gives a window into unlocking some potentially large speed improvements to VIBRANT in the
hmmsearch
steps. Since specifyinghmmsearch --cpu 0
makes each subprocess stick to only one CPU (instead of 2 in VIBRANT's current implementation), I could realistically doVIBRANT_run.py -t 16
on a 16 core machine andhmmsearch
woudn't be trying to operate on 32 cores.Finally, if I understand correctly,
hmmsearch
calculates e-values based on the # of sequences being searched and the # of entries in the database. Splitting up the input and running them separately is going to calculate e-values for each chunk differently, and not based on the input dataset as a whole. This may or may not be an issue at all (and might not be the cause behind the changes in output between identical runs), but one option for eachhmmsearch
run is to specifyhmmsearch -Z
and setZ
to be the # of queries in the entire input * # of entries in the specific HMM database. (see comment from Sean Eddy here.)Thanks, and please let me know if you have any questions. Bryan Merrill