WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
246 stars 52 forks source link

only CAZy annotation information in distilled files #85

Closed gruningerrj closed 1 year ago

gruningerrj commented 3 years ago

I appear to be having the same issue as #53 with richly annotated MAGS but only CAZy annotation information in the final distillate. I note the this was solved by adding KOs to the KEGG descriptions however I am not sure what files to find this information in and the easiest way to add this information to the appropriate files. I would be grateful if anyone could help me figure out how to do this.

Thanks

shafferm commented 3 years ago

When you set up DRAM did you give it the location of your KEGG proteins file or are you using KOfam?

gruningerrj commented 3 years ago

yes. I gave it to the location file prokaryotes.pep in the KEGG database. A colleague that use KOfam before we had access to KEGG didn't have any trouble. I am not sure if I used the wrong KEGG protein file?

shafferm commented 3 years ago

This maybe because of a change in format of the KEGG pep file headers. We haven't renewed our KEGG subscription for a bit over a year so that could be causing the issue. Could you run grep '>' prokaryotes.pep | head and share the output? You could potentially fix this issue by providing the --gene_ko_link_loc flag during set up. This is a file that has all gene IDs and KO IDs in a two column file. I can't remember where in the KEGG flat file database it's stored but I think it's called something like genes_ko.list.gz. I could help you rerun the processing of KEGG with that file added so that you don't need to rerun all of it if that would help.

gruningerrj commented 3 years ago

Here is the output from prokaryotes.pep

eco:b0001 thrL; thr operon leader peptide eco:b0002 thrA; fused aspartate kinase/homoserine dehydrogenase 1 eco:b0003 thrB; homoserine kinase eco:b0004 thrC; threonine synthase eco:b0005 yaaX; DUF2502 domain-containing protein YaaX eco:b0006 yaaA; peroxide stress resistance protein YaaA eco:b0007 yaaJ; putative transporter YaaJ eco:b0008 talB; transaldolase B eco:b0009 mog; molybdopterin adenylyltransferase eco:b0010 satP; acetate/succinate:H(+) symporter

The format of the genes_ko.list is below grep 'eco:' genes_ko.list | head eco:b3957 ko:K01438 eco:b3958 ko:K00145 eco:b3959 ko:K00930 eco:b3962 ko:K00322 eco:b3968 ko:K01977 eco:b3970 ko:K01980 eco:b3971 ko:K01985 eco:b3980 ko:K02358 eco:b3981 ko:K03073 eco:b3982 ko:K02601

rmFlynn commented 1 year ago

The use of kegg is considered advanced still, but we have tools that can be used to build compatible pep files. I hope this process will become more streamlined in the future. Until then, I will call this issue closed, as it is not relevant to the current code base.

mw55309 commented 10 months ago

It might be good to show somewhere in the documentation what you think the kegg.pep headers should look like, particularly for genes that more than one KO :)