Questions about geNomad

lsj-666 commented 4 months ago

Hi developer: Thank you for developing such nice tool! Here I have several questions to ask: 1.I read some papers that indicate when using Virsorter2 to get viral contigs, usually only choose the viral contigs sbusets which are longer than 5kb to avoid false positive, so as for geNomad, should we also set a threshold for the geNomad identified viral contigs length? 2.How do we identified AMGs in the geNomad pipelines? Is it necessary to put the geNomad identified viral contigs into Virsorter2 and Dramv to get AMGs? Is it better to predict AMGs among vOTUs(after clustering) than among viral contigs(before clustering)? Thanks!

apcamargo commented 4 months ago

Hi @lsj-666

1.I read some papers that indicate when using Virsorter2 to get viral contigs, usually only choose the viral contigs sbusets which are longer than 5kb to avoid false positive, so as for geNomad, should we also set a threshold for the geNomad identified viral contigs length?

This is true for any tool that performs classification based on information extracted from sequences. The shorter the sequence, the less information you have, the the more uncertain the prediction will be.

You can find the the precision (proportion of true positives among the positives) achieved by geNomad and other tools in the benchmarks I conducted in Figures 3 and 4, as well as Supplementary Table 3 of geNomad's paper.

By default, geNomad already requires sequences shorter than 2.5 kb to encode at least one virus hallmark gene, which reduces the amount of false positives significantly. If you want to be even more conservative, you may want use the --conservative flag (read here for details).

2.How do we identified AMGs in the geNomad pipelines? Is it necessary to put the geNomad identified viral contigs into Virsorter2 and Dramv to get AMGs? Is it better to predict AMGs among vOTUs(after clustering) than among viral contigs(before clustering)?

There's no need for that. You can just run DRAM directly from the file containing the sequences of viruses identified with geNomad (which should be in <prefix>_summary/<prefix>_virus.fna).

lsj-666 commented 4 months ago

Thank you for your help! As for question1, now I understand that geNomad is already conservative in dealing with false positive problems; As for question2, I'm still puzzle that the input of Dramv tool to obtain AMGs includes a .tab file (like viral-affi-contigs-for-dramv.tab) which is derived from Virsorter2 flag --prep-for-dramv (Virsorter command just like: virsorter run --seqname-suffix-off --viral-gene-enrich-off --provirus-off --prep-for-dramv -i checkv/combined.fna -w vs2-pass2 --include-groups dsDNAphage,ssDNA --min-length 5000 --min-score 0.5 -j 28 all ), can geNomad derive tab file by any flags? Thanks!

apcamargo commented 4 months ago

Ohh, ok. I thought DRAM (and DRAM-v) only needed a FASTA file with the genomes. Do you know how the tab file is supposed to be formatted?

Last case scenario, you can use VirSorter2 just to create the DRAM input from geNomad's predictions. You can use the flags --min-score 0 --provirus-off

lsj-666 commented 4 months ago

The tab file looks like this:

P-160198_1|7|l P-1__60198_11|127|693|6833|-|3300001112@JGI12322J13274_1000001@JGI12322J13274_1000001240|59.6|-|1|-|nan|- P-160198_1__2|742|1221|6833|-|Phage_cluster_215.ali_faa|99.7|-|1|Macro|56.6|- P-160198_13|1801|2847|6833|+|-|nan|-|2|-|nan|- P-160198_14|2958|3635|6833|+|Phage_cluster_929.ali_faa|215.8|-|1|Metallophos|33.0|- P-1__60198_15|3619|4389|6833|+|FGase|81.9|-|2|FGase|81.9|- P-160198_16|4503|6251|6833|+|-|nan|-|2|-|nan|- P-160198_17|6359|6832|6833|+|-|nan|-|2|-|nan|- P-171436_1|4|l P-1__71436_11|3|278|5677|+|-|nan|-|2|-|nan|- But I don't know what it specifically means haha.Thanks!

apcamargo commented 4 months ago

Ok, that is really not intuitive at first... It seems that DRAM-v is tightly associated with VirSorter2. Can you test my suggestion with a couple of genomes and let me know if it works?

Alternatively, why don't you try to annotate your genes with KEGG terms (using KOfamscan, for example) and check which pathways you find in the genomes? I'm not sure how well that would work for AMGs, though.

lsj-666 commented 4 months ago

OK, I will test your suggestion to see if it works. As far as I know, as for AMGs identification, dramv tool seems to be the most popular tool as it can servers as a filter to judge which genes are AMGs and which are not. It gives genes different flags to let users know which genes are AMGs and then annotate them. So we may just keep the AMGs that dramv identified and then use several annotation software(like KOfamscan) to annotate. Thank you so much!

apcamargo commented 4 months ago

No problem! I'll close this for now. Let me know if you got any other questions!

apcamargo / genomad

Questions about geNomad #77