apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

How to understand the results? #6

Closed alienzj closed 1 year ago

alienzj commented 1 year ago

Dear @apcamargo, thank you for developing so great tool!

I used it to do viral genome taxonomic assignment:

genomad end-to-end --min-score 0.8 --cleanup --splits 16 \
results/09.dereplicate/genomes/virome/representative/vMAGs_hmq.megahit.rep.fa.gz \
genomad_output ~/databases/ecogenomics/geNomad/genomad_db \
>genomad.log 2>&1

Here is the summary of results:

➤ zcat results/09.dereplicate/genomes/virome/representative/vMAGs_hmq.megahit.rep.fa.gz | rg -c "^>"
8439

➤ wc -l genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_plasmid_summary.tsv
483 genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_plasmid_summary.tsv

➤ wc -l genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_virus_summary.tsv
4933 genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_virus_summary.tsv

Since all vMAGs were identified by Virsorter2 and phamb, and have complete, high or medium quality evaluated by CheckV, below is what I don't understand currently:

  1. Why genomad can identify plasmids from viral genomes (vMAGs)? There 482 plasmids were found.
  2. The number of input genomes is 8439, why do only 4932 viral genomes have taxonomic assignments?

Thanks a lot!

apcamargo commented 1 year ago

Hi @alienzj

There are a couple of points here:

Aggregating the output of several classification tools is difficult because they will often diverge. geNomad is, in average, more accurate than VirSorter2 (see figure below), but VS2 is an amazing tool and I can't guarantee that geNomad will be correct in every single scenario they diverge. You should gather as much information as possible.

VIRUS_MAIN_BENCHMARK

The good news is that geNomad's output includes some information that makes it easier to understand why a given sequence was classified as a plasmid or virus:

This is an example of a _plasmid_summary.tsv file:

seq_name      length   topology   n_genes   genetic_code   plasmid_score   fdr   n_hallmarks   marker_enrichment   conjugation_genes
-----------   ------   --------   -------   ------------   -------------   ---   -----------   -----------------   -----------------
NC_002128.1   92721    Linear     88        11             0.9942          NA    5             46.4458             T_virB11;MOBP1
NC_002127.1   3306     Linear     3         11             0.9913          NA    1             1.6586              NA

Here you can see that these sequences encode plasmid hallmarks, which is a very good indication that those sequences are indeed plasmids. Try to check if your sequences also encode those. In addition, the marker_enrichment field is a number that increases proportionally to the number of plasmid markers. So, if the marker_enrichment of a given sequence is high (say, higher than 6), it is probably a plasmid, not a virus.

The same is true for the _virus_summary.tsv output. Try to run the classification again with a lower --min-score and see if the sequences look viral from the summary (if you like to do the filtering by yourself, based on your criteria, just leave --min-score 0). You might have some false positives in your dataset.

Again, if you are only interested in the taxonomy, just look at vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv :)

Hope this helps!

alienzj commented 1 year ago

Dear @apcamargo, thanks a lot for your quick and detailed reply. Sure sure, here are the tsv files generated by geNomad using the above command line: genomad_output_tsv.tar.gz

  1. From the figure you provided, it is excellent that geNomad has such an accurate performance.
  2. Yes, there's a chance Virsorter2 and phamb classify plasmids as viruses.
  3. I checked vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv, it recorded 8269 taxonomic assignments. It is quite useful. Yes, I shall change the --min-score to see what will happen based on your suggestions.

Thanks a lot again!

apcamargo commented 1 year ago

Thanks @alienzj

There are certainly lots of plasmids in your data. You can easily see that in the _plasmid_summary.tsv file:

alienzj commented 1 year ago

Hi, @apcamargo,

Thanks a lot for your reply. Yes, I shall remove those plasmids when doing virome profiling.

It is quite interesting that find so many plasmids from the viral vMAGs identified by VirSorter2 and phamb.

apcamargo commented 1 year ago

No problem! Let me know if you have more questions :)