AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
142 stars 37 forks source link

Reproducibility issues #25

Closed felipehcoutinho closed 3 years ago

felipehcoutinho commented 3 years ago

Hi there,

Congrats again on putting together these awesome tool. I've been using VIBRANT for while now but recently came across some issues that I wanted to discuss.

In summary, I have been getting different results for the exact same sequence between different runs of VIBRANT. I have originally experienced the issue with v1.0.1 which led me to upgrade to v1.2.1, but the issue persisted.

Here is what happened: I run VIBRANT on a set of 48,227 soil virome sequences, all 5 Kbp or longer. With the following commands:

python3 ../VIBRANT_run.py -i Full_Goller_2020_Genomes.fasta -t 140

and

python3 ../VIBRANT_run.py -i Full_Goller_2020_Genomes.fasta -t 100

Now both commands result in the same number of phages identified (21,644). Yet the actual sequences identified as phages differ as per the information in "/VIBRANT_results_Full_Goller_2020_Genomes/VIBRANT_genome_quality_Full_Goller_2020_Genomes.tsv".

Namely, in the first run, sequence "022-TFF-IBDA-C14072" was not identified as a putative phage, while in the second run it was identified as a low quality draft / lytic phage. Conversely, in the first run sequence "045-PEG-IDBA-C1763" was identified as a low quality draft / lytic phage, but was not classified as a phage in the second run.

This was the first issue that I encountered, suggesting that the number of threads used might affect the results. I was wondering if this was an expected behavior due to some stochastic step of the analysis.

I have also encountered a second issue related to these sequences. I performed a third run of VIBRANT using a large set of sequences from metagenomes and viromes from multiple ecosystems which included the full aforementioned soil dataset (Total 827,571 sequences). The command used was:

python3 ../VIBRANT_run.py -t 140 -i /home/rohit/felipe/Databases/RefSeqVir_Oct_19/RF_Host_Pred/External_Validation_Set/Mixed_Set5_Validation/Mixed_Set5_Genomes.fasta

In this particular run neither "022-TFF-IBDA-C14072" nor "045-PEG-IDBA-C1763" were classified as phages. Which brings me to the second issue, which suggests that the results of a single sequence might somehow be affect by the other sequences in the input fasta file. Could you please comment on this?

Below are my system specs:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 144 On-line CPU(s) list: 0-143 Thread(s) per core: 2 Core(s) per socket: 18 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz Stepping: 4 CPU MHz: 2599.999 CPU max MHz: 2301,0000 CPU min MHz: 1000,0000 BogoMIPS: 4601.93 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-17,72-89 NUMA node1 CPU(s): 18-35,90-107 NUMA node2 CPU(s): 36-53,108-125 NUMA node3 CPU(s): 54-71,126-143

The output of pip3 freeze:

apturl==0.5.2 asteval==0.9.13 attrs==19.3.0 awscli==1.16.133 backports.lzma==0.0.10 bcbio-gff==0.6.4 beautifulsoup4==4.4.1 biolib==0.0.46 biopython==1.66 blinker==1.3 botocore==1.12.123 Brlapi==0.6.4 chardet==2.3.0 checkbox-support==0.22 checkm-genome==1.1.2 colorama==0.3.7 command-not-found==0.3 comparem==0.0.23 cryptography==1.2.3 cssselect==1.0.3 cycler==0.10.0 decorator==4.2.1 defer==1.0.6 DendroPy==4.4.0 docutils==0.14 drep==2.5.3 feedparser==5.1.3 future==0.16.0 guacamole==0.9.2 helperlibs==0.1.9 html5lib==0.999 httplib2==0.9.1 idna==2.0 importlib-metadata==1.5.0 Jinja2==2.8 jmespath==0.9.3 joblib==0.14.1 language-selector==0.1 lmfit==0.9.13 louis==2.6.4 lxml==3.5.0 Mako==1.0.3 MarkupSafe==0.23 matplotlib==2.1.2 mcorr==20180506 more-itertools==8.2.0 mpld3==0.3 ncbi-genome-download==0.2.6 networkx==2.1 nose==1.3.7 numpy==1.17.0 oauthlib==1.0.3 onboard==1.2.0 packaging==20.1 padme==1.1.1 pandas==0.22.0 pathlib2==2.3.5 pexpect==4.0.1 pickle-mixin==1.0.2 Pillow==3.1.2 plainbox==0.25 pluggy==0.13.1 ppanggolin==1.1.85 protobuf==3.0.0b2 ptyprocess==0.5 py==1.8.1 pyasn1==0.1.9 pycups==1.9.73 pycurl==7.43.0 pyExcelerator==0.6.4.1 pyfaidx==0.5.9 pygobject==3.20.0 PyJWT==1.3.0 pyparsing==2.0.3 PyQt5==5.14.2 PyQt5-sip==12.7.2 pyquery==1.2.9 pysam==0.16.0.1 pysvg==0.2.2 pytest==5.3.5 python-apt==1.1.0b1+ubuntu0.16.4.1 python-dateutil==2.6.1 python-debian==0.1.27 python-systemd==231 pytz==2017.3 pyxdg==0.25 PyYAML==3.12 reportlab==3.3.0 requests==2.9.1 rpy2==2.9.2 rsa==3.4.2 s3transfer==0.2.0 scikit-learn==0.21.3 scipy==1.0.0 seaborn==0.9.0 sessioninstaller==0.0.0 six==1.12.0 ssh-import-id==5.5 system-service==0.3 tensorflow==0.10.0rc0 tqdm==4.31.1 ubuntu-drivers-common==0.0.0 ufw==0.35 unattended-upgrades==0.1 uncertainties==3.0.3 unity-scope-calculator==0.1 unity-scope-chromiumbookmarks==0.1 unity-scope-colourlovers==0.1 unity-scope-devhelp==0.1 unity-scope-firefoxbookmarks==0.1 unity-scope-gdrive==0.7 unity-scope-manpages==0.1 unity-scope-openclipart==0.1 unity-scope-texdoc==0.1 unity-scope-tomboy==0.1 unity-scope-virtualbox==0.1 unity-scope-yelp==0.1 unity-scope-zotero==0.1 urllib3==1.13.1 usb-creator==0.3.0 virtualenv==15.0.1 wcwidth==0.1.8 xdiagnose==3.8.4.1 xkit==0.0.0 XlsxWriter==0.7.3 zipp==1.2.0

And here is a link to downloading the fasta files that I used as input:

-- link deleted --

KrisKieft commented 3 years ago

Hi,

I appreciate that you are finding VIBRANT useful, thank you. This was a known issue with v1.0.1 that did correlate with differing threads/runs, but should not have persisted with an update to v1.2.1. The first thing I would check is the numpy and scikit-learn versions which both look correct. As a secondary check you can look in the VIBRANT_log_run file and there would be a "CAUTION" statement if scikit-learn was a different version. You can also check the log file to make sure nothing strange happened and you did indeed run v1.2.1. The same goes for doing /scripts/VIBRANT_annotation.py --version to double check that the auxiliary script updated to v1.2.1 as well.

I'm not an expert in the mechanics of how CPUs manage threads, but I would also suggest doing a couple runs at ~100-120 threads and see if the results are the same. It's possible that using -t 140 and only leaving 4 remaining threads was not enough for background system processes and managing python transitions. I doubt this is an issue but it's a possibility.

I downloaded your data and I'll do a couple runs of my own to check the results. The results of one sequence should not effect the other. I'll get back soon.

Kris

felipehcoutinho commented 3 years ago

Thank you very much for the speedy reply.

I can confirm that there were no "CAUTION" statements in any of the log files and that the version of VIBRANT and of VIBRANT_annotation.py were both v1.2.1. I am currently re-running the analysis with 100 threads and will let you know once I have the results.

KrisKieft commented 3 years ago

I ran you dataset with 25 and 30 threads. I also ended up with two scaffolds (022-TFF-IBDA-C17350 and 045-PEG-IDBA-C1763) that were different between the results.

For both of these scaffolds the neural network model predicted them to be different between the two runs (virus:plasmid and virus:organism). You can find this data in VIBRANT_results_*/VIBRANT_machine_*. That would be the source of differential results. My assumption is that these scaffolds are close enough to the "gray area" between classification that the model will place them in different groups depending on the run. The model has a "random state" for a random number generation that it uses and I do not have that set to a set value. This can lead to very minor reproducibility issues. For you this turns out to be about 2 sequences in a total of 48,227. As a common example, if you run BLAST on two different machines you'll likely end up with very slightly different e-value results due to random number generation in the algorithm. I would not be concerned about this level of reproducibility as it probably will not happen for most datasets. I hope that helps and please let me know if you have any more questions.

Kris

felipehcoutinho commented 3 years ago

Thank your the input. I ran VIBRANT using 100 threads three times on the same dataset and got consistent results, so it does seem that, at least in my server, the issue is related to using too many threads.

Could you share the results from your run? I want to make a last comparison just to make sure everything is ok with my installation.

KrisKieft commented 3 years ago

Here are the names of identified scaffolds for my two runs.

Full_Goller_2020_Genomes_V2.phages_combined.txt Full_Goller_2020_Genomes.phages_combined_V1.txt

felipehcoutinho commented 3 years ago

The results of V1 match perfectly with my runs using 100 threads. Thank you for the help.