Closed ChristopherBurgess-USDA closed 1 year ago
It is totally possible. Like you said you need to create the format of annotations.tsv
that DRAM distill expects. DRAM doesn't care about column order (and for the most part missing columns) outside of the first column being the gene name. To see the other columns (when annotating with KOfam and other default databases) check out the attached file. The only ones used during distillation for metabolic information are kegg_id
, kegg_hit
(we look for ECs there), peptidase_family
, cazy_hits
(we look for CAZy IDs and EC's here) and pfam_hits
. There also needs to a column with names that genes are grouped by. By default that is the column called fasta
but you can set this as a parameter. Missing columns are totally fine so hypothetically you could start with an annotations file that has the first column with the gene name, a second column with the grouping information and a third column kegg_id
with KOs and you would get some decent distillation results.
Thanks for that information, I think I have the basic structure down with rownames being the gene_id, fasta (your MAG groupping in general). However, I'm not entirely sure how your *_hit
columns were generated. I have all the *_id
columns (i.e. kegg_id
, pfam_id
, ec_id
, cog_id
, ...) with their corrispoding *_bitScore
/*_eVal
columns. Will the metabolic distill work for those columns or do I need to generate the _hit*
columns? If so is there a DRAM function call or database I could manually hash to populate that column for DRAM?
Thanks again for taking time to help me out. I really wish I could use DRAM from end to end but the 2.5kb threshold filters too much of my data.
You should be fine with only those. If you have these in a tsv then you should be able to jump right to the DRAM.py distill
step and you'd use that tsv as the input. If you don't want to build all of the databases then you can only run the DRAM-setup.py update_dram_forms
command and you will have everything you need to distill.
I am closing this issue
Hello I was wondering if it is possible to use DRAM's distill annotations on annotations done outside of DRAM. I am working with soil metagenomes which 1) want to use reads that are >1kpb not just contigs > 2.5kb and 2) cannot bin contigs into any decent MAGs since the reads are quite heterogeneous. I can generate an annotate file through prodigy/diamond/protein dbs seperately; however, I was wondering if 1) this is something you think could be safely distilled using DRAM and 2) if so, what is the file structure for the
annotation.tsv
that DRAM needs?