WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
249 stars 52 forks source link

Question about implementation of Pfam #217

Closed cmkobel closed 2 years ago

cmkobel commented 2 years ago

This is not a bug, but simply a question about how Pfam is implemented in DRAM.

Hello, I'm just trying to understand how DRAM works. In the Pfam documentation 10.1093/nar/gky995 it is clearly stated that each protein family is defined by profile HMMs. But in the DRAM paper 10.1093/nar/gkaa621 is is stated that mmseqs2 is used for searches in the Pfam database.

I looked into the mmseqs2 documentation mmseqs.com/latest/userguide.pdf, but it doesn't mention anything about an implementation of searching for profile HMMs.

Can somebody explain what algorithm is used for the Pfam database? The Pfam documentation suggests HMMER 3, whereas the DRAM documentation states mmseqs2.

rmFlynn commented 2 years ago

Mmseqs2 provides the search function for profile searching, the command we use is mmseqs', 'search', query_db, pfam_profile with some arguments.

""" mmseqs search: Compares all sequences in the query database with all sequences in the target database, using the prefiltering and alignment modules. MMseqs2 search supports sequence/sequence, profile/sequence or sequence/profile searches. """

So we download the pfam database from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz' convert to a MSA then to a mmseqs profile. The commands for this are all provided by mmseqs2 mmseqs convertmsa, mmseqs msa2profile. I hope that helps explain the process. To understand how DRAM transforms and uses each database, you will need to read database_processing.py and annotate_bins.py. I hope that helps.

cmkobel commented 2 years ago

OK. Great! That answers my question. Thanks.