apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
168 stars 17 forks source link

Difference between "Unclassified" and "Viruses" taxonomy? #19

Closed ecampbell50 closed 1 year ago

ecampbell50 commented 1 year ago

Thanks for making such a great tool! It's super easy to run and exactly what I need for my project.

Can you explain what the difference is between "Unclassified" and just "Viruses" in the taxonomy output? Does it mean that Unclassified hits are unknown if they're viruses at all?

For example, I have 292 'Unclassified' hits and 7668 'Viruses' hits across 2000 genomes, does this mean the unclassified could have possibly been plasmid/chromosome?

apcamargo commented 1 year ago

Hi @ecampbell50. It's great to hear that geNomad has been useful for you!

Are you referring to the taxonomy column of the _virus_summary.tsv file? If so, all the sequences listed in that file were classified as viral. Sequences with an "Unclassified" value in that column were not be assigned to any virus taxon, but are likely viral regardless.

ecampbell50 commented 1 year ago

Yes sorry I should've specified the file! Thanks for clarifying the 'Unclassified' part. I was wondering what the "Viruses" classification also refers to? I have some hits showing just "Viruses" as a taxonomy:

Screenshot 2023-04-04 at 16 59 35

(Apologies I should've put this in my first message)

apcamargo commented 1 year ago

No worries!

"Viruses" just mean that the genome could not be assigned to a specific realm. This can happen when, for instance, the genome was annotated by two markers with conflicting taxonomies, so there's no consensus realm.

ecampbell50 commented 1 year ago

Perfect thank you so much! 😊

apcamargo commented 1 year ago

I'm happy to help! Let me know if you have any other questions :)

diego00012138 commented 3 weeks ago

This solved one of my questions too, but I have a lot of contigs identified as virus by virsorter2-checkv-dramv pipeline which were not classified as virus or unclassified by genomad, up to about 50% of total contigs. How can i explain this, cus the para of the previous process are quite demanding like below: https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-5qpvoyqebg4o/v3?step=1

apcamargo commented 3 weeks ago

It's difficult to tell what is going on without further context. Are those sequences short? What is their CheckV quality? Have you tried to run geNomad with the --relaxed parameter?