AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
142 stars 37 forks source link

Recommended minimum length #28

Closed Puumanamana closed 3 years ago

Puumanamana commented 3 years ago

Hi,

Thank you for developing VIBRANT. It looks very promising and I can't wait to test it. I noticed there was a minimum contig length set to 1kb by default in VIBRANT. Would you recommend not to go lower? Do you know how sensitivity and specificity are affected by the length of the contig? I'm also mostly worried about false negatives (not calling a virus when it actually is one). Is there any parameter I can adjust to reduce those?

KrisKieft commented 3 years ago

Hi,

Thank you for your interest in VIBRANT!

To your first question, there is no way to go below 1 kb. In my personal opinion 1 kb provides very little information and lower (e.g., 500 bp) provides no useful information at all. In my own analyses I actually rarely use scaffolds shorter than 3-5 kb. For VIBRANT specifically a main limitation will be that it also won't allow scaffold with fewer than 4 open reading frames (proteins) so 1 kb length is the minimum cutoff only if it also encodes at least 4 proteins.

Along with that last comment, sensitivity and specificity are mainly reliant on the number of proteins present, not the scaffold length. So a 10 kb scaffold with 5 proteins is harder to analyze than a 5 kb scaffold with 10 proteins. There is not much of a difference in sensitivity, but recall drops with smaller scaffolds. Check out Supplementary Figure 1 in the manuscript. That figure is only on 1 kb and 3 kb scaffolds to show the lowest extremes, but with 3 kb scaffolds VIBRANT performs great, it just misses those 1 kb scaffolds due to many of them not encoding 4 proteins.

If you have a dataset that consists of viruses then you can use the -virome setting. This will alter the filtering after the machine learning model predicts viruses. This setting will increase recall but also increase false identifications of bacteria/archaea and plasmids. If you aren't worried about false identifications or if you have a virome dataset then you can use the -virome setting.

I hope that helps. Let me know if you need anything else.

Kris

Puumanamana commented 3 years ago

Thank you for your quick answer. This is all very helpful, I think I'm going to use virome then. My dataset is not composed of viruses only (far from it), but I wan't to miss as few as possible.

Cédric