Closed alienzj closed 1 year ago
Hi @alienzj
There are a couple of points here:
_genes.tsv
and _summary.tsv
files? Keep in mind that VirSorter2 and phamb don't take plasmids into account, so there's a chance they classify plasmids as viruses.--min-score 0.8
), some sequences won't be classified as viruses or plasmids. The good news is that you can still check the taxonomic assignment of all sequences (regardless of their classification) in the annotation discovery (try to look for vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv
).Aggregating the output of several classification tools is difficult because they will often diverge. geNomad is, in average, more accurate than VirSorter2 (see figure below), but VS2 is an amazing tool and I can't guarantee that geNomad will be correct in every single scenario they diverge. You should gather as much information as possible.
The good news is that geNomad's output includes some information that makes it easier to understand why a given sequence was classified as a plasmid or virus:
This is an example of a _plasmid_summary.tsv
file:
seq_name length topology n_genes genetic_code plasmid_score fdr n_hallmarks marker_enrichment conjugation_genes
----------- ------ -------- ------- ------------ ------------- --- ----------- ----------------- -----------------
NC_002128.1 92721 Linear 88 11 0.9942 NA 5 46.4458 T_virB11;MOBP1
NC_002127.1 3306 Linear 3 11 0.9913 NA 1 1.6586 NA
Here you can see that these sequences encode plasmid hallmarks, which is a very good indication that those sequences are indeed plasmids. Try to check if your sequences also encode those. In addition, the marker_enrichment
field is a number that increases proportionally to the number of plasmid markers. So, if the marker_enrichment
of a given sequence is high (say, higher than 6), it is probably a plasmid, not a virus.
The same is true for the _virus_summary.tsv
output. Try to run the classification again with a lower --min-score
and see if the sequences look viral from the summary (if you like to do the filtering by yourself, based on your criteria, just leave --min-score 0
). You might have some false positives in your dataset.
Again, if you are only interested in the taxonomy, just look at vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv
:)
Hope this helps!
Dear @apcamargo, thanks a lot for your quick and detailed reply. Sure sure, here are the tsv files generated by geNomad using the above command line: genomad_output_tsv.tar.gz
vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv
, it recorded 8269 taxonomic assignments. It is quite useful. Yes, I shall change the --min-score
to see what will happen based on your suggestions.Thanks a lot again!
Thanks @alienzj
There are certainly lots of plasmids in your data. You can easily see that in the _plasmid_summary.tsv
file:
marker_enrichment
, which means that there are multiple plasmid markers in them.n_hallmarks
)conjugation_genes
). It is important to note that there are phages capable of conjugation, though.Hi, @apcamargo,
Thanks a lot for your reply. Yes, I shall remove those plasmids when doing virome profiling.
It is quite interesting that find so many plasmids from the viral vMAGs identified by VirSorter2 and phamb.
No problem! Let me know if you have more questions :)
Dear @apcamargo, thank you for developing so great tool!
I used it to do viral genome taxonomic assignment:
Here is the summary of results:
Since all vMAGs were identified by Virsorter2 and phamb, and have complete, high or medium quality evaluated by CheckV, below is what I don't understand currently:
Thanks a lot!