WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
245 stars 52 forks source link

No distillate information found warning #84

Open ucassee opened 3 years ago

ucassee commented 3 years ago

Hi, There are too many warnings like the following when I run DRAM-v.py distill -i annotation/annotations.tsv -o annotation/distilled.

/DRAM/lib/python3.9/site-packages/mag_annotator/summarize_vgfs.py:101: UserWarning: No distillate information found for gene TS03-LD14_NODE_934__full-cat_1_15 
warnings.warn("No distillate information found for gene %s" % gene)

Is there something wrong with my result?

Thanks,

shafferm commented 3 years ago

There is nothing wrong with your result. This is to let you know that there will be some rows in your amg_summary.tsv file which have no metabolic information. This is because the gene in question has a database identifier assigned that is a known AMG but is not in our distillate. We are working on eliminating these by adding all of these genes to the distillate. So not a problem. Just a gene that you won't get quite as much metabolic information about.

cerebis commented 3 years ago

Just to add some detail -- though I expect you're on top of it -- I am seeing this for genes that are all coming from PFAM. I can see looking at the DRAM database that the summary tables carries no PFxxxx ids. I guess this is a to-do or do you feel that these are not important, too speculative a function or false positives amgs?

Note below I am using pdb and upscaling warnings to errors to inspect the runtime state.

Eg. For the following row from potential_amgs from the method def make_viral_distillate(potential_amgs, genome_summary_frame)

fasta                                                     edge_662__full_1-cat_1
scaffold                                                  edge_662__full_1-cat_1
gene_position                                                                 22
start_position                                                             12819
end_position                                                               13820
strandedness                                                                  -1
rank                                                                           D
kegg_id                                                                         
kegg_hit                                                                        
viral_id                                                          YP_009124812.1
viral_hit                      YP_009124812.1 hydrolase [Mycobacterium phage ...
viral_RBH                                                                  False
viral_identity                                                             0.274
viral_bitScore                                                              94.0
viral_eVal                                                                   0.0
pfam_hits                      alpha/beta hydrolase fold [PF07859.15]; Prolyl...
cazy_hits                                                                       
vogdb_description                         sp|I6Y9F7|LIPQ_MYCTU Esterase LipQ; Xu
vogdb_categories                                                              Xu
heme_regulatory_motif_count                                                    0
virsorter_category                                                           1.0
auxiliary_score                                                                1
is_transposon                                                              False
amg_flags                                                                     MK
peptidase_id                                                          MER0155040
peptidase_family                                                            S09X
peptidase_hit                  MER0155040 - family S9 unassigned peptidases (...
peptidase_RBH                                                               True
peptidase_identity                                                          0.97
peptidase_bitScore                                                         478.0
peptidase_eVal                                                               0.0
Name: edge_662__full_1-cat_1_22, dtype: object

get_ids_from_row(row) returns the set

{'', 'PF08840', 'PF02129', 'PF01738', 'PF00326', 'PF05448', 'PF12146', 'PF00135', 'PF10340', 'PF12740', 'S09X', 'PF07859', 'PF02230'}

Which produces an empty set when intersected with set(genome_summary_frame.index) and hence the "No distillate" warning.

Looking at the detail in the row, it does seem like an interesting gene, whereas I'd sort of expected that it would have been a DUF.

shafferm commented 3 years ago

You are correct that these will all be from PFAM. They are genes which have been previously recognized as AMGs in other studies but either don't fit cleanly into the distillate categories that we have currently defined or we do not feel comfortable calling them metabolic genes based on only one domain. It's definitely something on our to-do list to fix in the future. We want all the functions from these genes to make it into the distillate in some form.