WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
239 stars 50 forks source link

No information for some AMGs #219

Closed chenyj8 closed 1 year ago

chenyj8 commented 1 year ago

Hello,

I got the amg summary file, but there are some genes identified as AMG (score < 4) but without gene_id, gene_description, category, header, subheader, and module. Why are these genes identified as AMGs even though they are unannotated?

rmFlynn commented 1 year ago

It is normal for there to be empty places in those columns in the AMG summary because we only report a select set of known genes in the distillate but have a broader criterion for AMGs. If we reported every posible hit the information would lose value as it would be too large to interpret. Instead, we use the flags, as described in the DRAM paper, to explain why AMGs were selected. However, it is irregular for some flags to be empty, because those flags usually indicate that one of our select genes we report in that summary was matched. This can still result in empties if that match was filtered out. It is also possible that the information can be lost or not reported as a result of merging files. I assume you ran DRAM-v annotate in the normal way?

If you see rows without data, and with these flags M: Has a gene id known to the distillate. K: That know gene id is related to a known AMG Would you be willing to send us your log file (check for errors) and your annotations.tsv?

chenyj8 commented 1 year ago

Thanks for the clarification. I did run DRAM-v annotate using default parameters:

DRAM-v.py annotate -i final-viral-combined-for-dramv.fa -v viral-affi-contigs-for-dramv.tab -o DRAMv_annotation --threads 94

Then distill:

DRAM-v.py distill -i DRAMv_annotation/annotations.tsv -o DRAMv_distilled

I have attached the log file (std output), annotations.xlsx (tsv is not supported in GitHub), and amg_summary.xlsx.

log.txt annotations.xlsx amg_summary.xlsx

rmFlynn commented 1 year ago

Hi, thanks for waiting so long for me to get back to this. I created a fix that I hope will make it more clear how genes end up being called likely AMGs in dram.

All likely AMG genes are included in the summary, but only genes that are in our genome_summary_form, which you can see here, are put into the AMG summary. Genes that matched only to our known AMG database, see here, are included but not labeled in the summary.

The amg_summary.tsv file, which I have attached, has all the data from the AMG summary added. Having all the data can be a bit cumbersome. A gene can match to a maximum of 3 known metabolic genes and still be an AMG. This limit not only filters out non-viral genes, it also limits the number of lines that gene has in the amg summary. You will see that the new file is slightly larger. The data is mostly self-explanatory, there are 5 new columns in amg_summary_plus. The potential_amg amg_summary.zip , metabolism, reference, and verified columns are all from the amg_database and will be blank for genes that are only in the genome_summary_form. The gene_id_origin column will tell you if this is a match to the genome_summary_form, or the amg_database.

Note that I probably I do not have your version of dram installed, and I don’t know what version of the DRAM sheets you had. Because these sheets change, 2 of the genes that were AMGs in your data are not in mine and so remain blank.

I hope that helps clarify the situation. Thanks again for waiting!