WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
250 stars 52 forks source link

Using DRAM to distill annotations from alternative annotation pipelines #77

Closed ChristopherBurgess-USDA closed 1 year ago

ChristopherBurgess-USDA commented 3 years ago

Hello I was wondering if it is possible to use DRAM's distill annotations on annotations done outside of DRAM. I am working with soil metagenomes which 1) want to use reads that are >1kpb not just contigs > 2.5kb and 2) cannot bin contigs into any decent MAGs since the reads are quite heterogeneous. I can generate an annotate file through prodigy/diamond/protein dbs seperately; however, I was wondering if 1) this is something you think could be safely distilled using DRAM and 2) if so, what is the file structure for the annotation.tsv that DRAM needs?

shafferm commented 3 years ago

It is totally possible. Like you said you need to create the format of annotations.tsv that DRAM distill expects. DRAM doesn't care about column order (and for the most part missing columns) outside of the first column being the gene name. To see the other columns (when annotating with KOfam and other default databases) check out the attached file. The only ones used during distillation for metabolic information are kegg_id, kegg_hit (we look for ECs there), peptidase_family, cazy_hits (we look for CAZy IDs and EC's here) and pfam_hits. There also needs to a column with names that genes are grouped by. By default that is the column called fasta but you can set this as a parameter. Missing columns are totally fine so hypothetically you could start with an annotations file that has the first column with the gene name, a second column with the grouping information and a third column kegg_id with KOs and you would get some decent distillation results.

ecoli_annotations.txt

ChristopherBurgess-USDA commented 3 years ago

Thanks for that information, I think I have the basic structure down with rownames being the gene_id, fasta (your MAG groupping in general). However, I'm not entirely sure how your *_hit columns were generated. I have all the *_id columns (i.e. kegg_id, pfam_id, ec_id, cog_id, ...) with their corrispoding *_bitScore/*_eVal columns. Will the metabolic distill work for those columns or do I need to generate the _hit* columns? If so is there a DRAM function call or database I could manually hash to populate that column for DRAM?

Thanks again for taking time to help me out. I really wish I could use DRAM from end to end but the 2.5kb threshold filters too much of my data.

shafferm commented 3 years ago

You should be fine with only those. If you have these in a tsv then you should be able to jump right to the DRAM.py distill step and you'd use that tsv as the input. If you don't want to build all of the databases then you can only run the DRAM-setup.py update_dram_forms command and you will have everything you need to distill.

rmFlynn commented 1 year ago

I am closing this issue