Closed felipehcoutinho closed 3 years ago
Hi,
I appreciate that you are finding VIBRANT useful, thank you. This was a known issue with v1.0.1 that did correlate with differing threads/runs, but should not have persisted with an update to v1.2.1. The first thing I would check is the numpy and scikit-learn versions which both look correct. As a secondary check you can look in the VIBRANT_log_run
file and there would be a "CAUTION" statement if scikit-learn was a different version. You can also check the log file to make sure nothing strange happened and you did indeed run v1.2.1. The same goes for doing /scripts/VIBRANT_annotation.py --version
to double check that the auxiliary script updated to v1.2.1 as well.
I'm not an expert in the mechanics of how CPUs manage threads, but I would also suggest doing a couple runs at ~100-120 threads and see if the results are the same. It's possible that using -t 140
and only leaving 4 remaining threads was not enough for background system processes and managing python transitions. I doubt this is an issue but it's a possibility.
I downloaded your data and I'll do a couple runs of my own to check the results. The results of one sequence should not effect the other. I'll get back soon.
Kris
Thank you very much for the speedy reply.
I can confirm that there were no "CAUTION" statements in any of the log files and that the version of VIBRANT and of VIBRANT_annotation.py were both v1.2.1. I am currently re-running the analysis with 100 threads and will let you know once I have the results.
I ran you dataset with 25 and 30 threads. I also ended up with two scaffolds (022-TFF-IBDA-C17350 and 045-PEG-IDBA-C1763) that were different between the results.
For both of these scaffolds the neural network model predicted them to be different between the two runs (virus:plasmid and virus:organism). You can find this data in VIBRANT_results_*/VIBRANT_machine_*
. That would be the source of differential results. My assumption is that these scaffolds are close enough to the "gray area" between classification that the model will place them in different groups depending on the run. The model has a "random state" for a random number generation that it uses and I do not have that set to a set value. This can lead to very minor reproducibility issues. For you this turns out to be about 2 sequences in a total of 48,227. As a common example, if you run BLAST on two different machines you'll likely end up with very slightly different e-value results due to random number generation in the algorithm. I would not be concerned about this level of reproducibility as it probably will not happen for most datasets. I hope that helps and please let me know if you have any more questions.
Kris
Thank your the input. I ran VIBRANT using 100 threads three times on the same dataset and got consistent results, so it does seem that, at least in my server, the issue is related to using too many threads.
Could you share the results from your run? I want to make a last comparison just to make sure everything is ok with my installation.
Here are the names of identified scaffolds for my two runs.
Full_Goller_2020_Genomes_V2.phages_combined.txt Full_Goller_2020_Genomes.phages_combined_V1.txt
The results of V1 match perfectly with my runs using 100 threads. Thank you for the help.
Hi there,
Congrats again on putting together these awesome tool. I've been using VIBRANT for while now but recently came across some issues that I wanted to discuss.
In summary, I have been getting different results for the exact same sequence between different runs of VIBRANT. I have originally experienced the issue with v1.0.1 which led me to upgrade to v1.2.1, but the issue persisted.
Here is what happened: I run VIBRANT on a set of 48,227 soil virome sequences, all 5 Kbp or longer. With the following commands:
python3 ../VIBRANT_run.py -i Full_Goller_2020_Genomes.fasta -t 140
and
python3 ../VIBRANT_run.py -i Full_Goller_2020_Genomes.fasta -t 100
Now both commands result in the same number of phages identified (21,644). Yet the actual sequences identified as phages differ as per the information in "/VIBRANT_results_Full_Goller_2020_Genomes/VIBRANT_genome_quality_Full_Goller_2020_Genomes.tsv".
Namely, in the first run, sequence "022-TFF-IBDA-C14072" was not identified as a putative phage, while in the second run it was identified as a low quality draft / lytic phage. Conversely, in the first run sequence "045-PEG-IDBA-C1763" was identified as a low quality draft / lytic phage, but was not classified as a phage in the second run.
This was the first issue that I encountered, suggesting that the number of threads used might affect the results. I was wondering if this was an expected behavior due to some stochastic step of the analysis.
I have also encountered a second issue related to these sequences. I performed a third run of VIBRANT using a large set of sequences from metagenomes and viromes from multiple ecosystems which included the full aforementioned soil dataset (Total 827,571 sequences). The command used was:
python3 ../VIBRANT_run.py -t 140 -i /home/rohit/felipe/Databases/RefSeqVir_Oct_19/RF_Host_Pred/External_Validation_Set/Mixed_Set5_Validation/Mixed_Set5_Genomes.fasta
In this particular run neither "022-TFF-IBDA-C14072" nor "045-PEG-IDBA-C1763" were classified as phages. Which brings me to the second issue, which suggests that the results of a single sequence might somehow be affect by the other sequences in the input fasta file. Could you please comment on this?
Below are my system specs:
The output of pip3 freeze:
And here is a link to downloading the fasta files that I used as input:
-- link deleted --