apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

High discrepancy between the high amount of viruses for geNomad and the low number of complete viruses #79

Open bhagavadgitadu22 opened 4 months ago

bhagavadgitadu22 commented 4 months ago

Hello,

I am currently struggling with metagenomes obtained from environmental biofilms (river rocks). I have 50 metagenomes but I obtain only 100 checkV high-quality viruses. I would expect a lot more from this amount of metagenomes, probably about 1000 HQ viruses, especially considering that geNomad identifies about 500 viruses per metagenome (about 30000 total). My understanding is that all my viruses are small genome fragments and very few were completed during the assembly so:

In the latter case, I am wondering what would be your advice:

Any insight would be welcome!

Martin

apcamargo commented 4 months ago

My guess is that this is a problem with the assembly and/or the underlying sequencing data.

Maybe my DNA was fragmented to start with even before I made my libraries, in which case my only hope would be to do long-read sequencing as well

This is certainly a possibility. Did you compare the contiguity of these assemblies to the ones where you get the expected number of HQ virus genomes?

Forget about checkV and make my own thresholds (example keep predicted viruses with more than 5 hallmark genes in geNomad...)

By doing this, you'd bias the selection towards more complete genomes (since you'd be requiring several hallmarks to be present, which is more likely in long sequences). But this doesn't solve the underlying issue that you're not retrieving as many HQ genomes as you'd expect. You could, however, report the genomes you identify using both CheckV estimates and no. geNomad hallmarks. I'd suggesti against not using CheckV, since it is a standard practice in the field and provides a standard that allow different studies to be compared.

Look more into the numerous checkV Not-determined viruses (no viral genes detected) considering geNomad identified them as viruses for a reason so they might be super novel viruses that checkV cannot evaluate

This is possible, but shouldn't be common. I'd take a closer look at those sequences and evaluate which genes they encode (eg.: do they encode hallmarks? which?)

Include checkV Medium-quality viruses into the analysis rather than only High-quality ones

That's also a possibility, but it really depends on the questions you're trying to answer. In several cases you need a complete or close to complete genome to extract reliable biological data.

Complement geNomad viruses with other viral detection tools, for instance Metaviral SPADES because it uses a completely different approach so it might return completely different viruses

By metaviral SPADES do you mean viralVerify (the tool they provide with the paper). If so, their approach is very similar to geNomad's marker classifier. In the beggining of geNomad's development, we did use viralVerify to help us filtering out some garbage (see this). But nowadays geNomad already provides lots of options to do that (eg.: no. of hallmarks, marker enrichment, etc.), so I'm not using viralVerify in recent projects. You can try it anyway and check if it helps you retrieve more stuff, though.

Anything else that I did not think of :)

If you ran geNomad with default parameters, it applies some filters to remove sequences that are not super reliable (like short genomes without hallmarks). You may try to run geNomad with the --relaxed flag to turn the filters off and increase sensitivity. Another option is to make the marker search more sensitive and annotate more genes (use -s 6.2 or higher, for example), but this will make the search slower and consume more memory.

All that said, my feeling is that the assemblies are the underlying problem, since you got the expected number of HQ viruses in other samples. Or maybe it's biological.

bhagavadgitadu22 commented 4 months ago

Thank you very much for the detailed answer!

I think indeed the problem is located before the viral detection tool (so either the biofilms or the lab practices or the assemblies are problematic).

Concerning the assemblies, the N50 of my metagenomes is around 1600 bp whereas on a previous similar project we had an N50 of 2700 bp. It does seem like the problem locates there but so far I did not find a solution.

I tested a bunch of assemblers (megahit with or without --metalarge option, metaspades with or without --metaviral option) and the metaviralspades assemblies were the best ones. For my test biofilm, I get 24 HQ viruses with metaviralspades where I was getting 2 with megahit+geNomad. Following the suggestion of the metaviral SPADES authors, I used checkV directly on metaviral SPADES results. When I tried to use geNomad in between though, it detects only 12 viruses where checkV was giving me 24 HQ viruses. Would you trust checkV results directly applied on the metaviralSPADES assemblies?

My feeling so far is that I need to make the best of what I have, so probably a combination of metaviralSPADES HQ viruses with megahit+geNomad HQ viruses would work to get a higher number of viruses. I am interested in the temporal changes over a few weeks so Medium-quality viruses should be okay for this purpose.

apcamargo commented 4 months ago

Concerning the assemblies, the N50 of my metagenomes is around 1600 bp whereas on a previous similar project we had an N50 of 2700 bp. It does seem like the problem locates there but so far I did not find a solution.

If the assemblies were processed uniformly, my bet is that the problem is in the library itself.

Would you trust checkV results directly applied on the metaviralSPADES assemblies?

Can you share the sequences that CheckV considered HQ but geNomad didn't classify as viral? Do you know if this changes if you use the --relaxed flag in geNomad?

My recommendation, in principle, is to not consider them as viral because CheckV is not really meant for that and I've seem cases where just using CheckV would provide unreliable results. But this could vary, so it would be better to take a close look at those sequences.

My feeling so far is that I need to make the best of what I have, so probably a combination of metaviralSPADES HQ viruses with megahit+geNomad HQ viruses would work to get a higher number of viruses. I am interested in the temporal changes over a few weeks so Medium-quality viruses should be okay for this purpose.

I agree. Probably the best thing to do will be to use geNomad with default parameters and allow MQ genomes. This is a good compromise as you'd still be conservative in the sense of requiring sequences to have a strong "viral signal" (according to geNomad), but would relax the CheckV cutoff (as MQ genomas are just fine for your goal).

bhagavadgitadu22 commented 4 months ago

Here are an access to the results for my test sample (https://we.tl/t-zPQcs634nY). I get 24 HQ viruses with metaviralSPADES+checkV (file scaffolds_HQ.fasta). Among them, 2 are found by geNomad and 9 with geNomad --relaxed. The detailed list of which tool identified which viruses and of the characs given by CheckV is in the file checkv_data_HQ_viruses.txt.

I plotted all viruses with pharoka_multiplotter.py to see how the potential viruses looked like. I put the png in 3 different folders (plots_not_viruses_but_HQ for the viruses not found by geNomad, plots_relaxed_HQ for viruses found only in relaxed node of geNomad and plots_double_HQ for viruses found with both modes of geNomad). I do not see an apparent pattern between the 3 categories though.

apcamargo commented 4 months ago

These don't look good. :/

The reason CheckV classified them as Complete is because they have direct terminal repeats, not because they "look like complete". Can you check the completeness.tsv file generated by CheckV? There you will find whether these sequences are similar to known complete viruses and what is their estimated completeness (not taking into account the presence of DGRs). Also, the presence of virus hallmark genes (which you can find in geNomad's output) is a good filter to remove some junk.