Can you use Hmmer annotated files to get count table?

EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs

http://hmmer.org

Other

322 stars 70 forks source link

Can you use Hmmer annotated files to get count table? #305

Closed lilyyycao closed 1 year ago

lilyyycao commented 1 year ago

Hello, I have done a de novo assembly and used prodigal in order to predict protein coding gene regions. I then used hmmer to do annotations based off of the pfamA database and got the resulting output file but was wondering if there was a way to output a file with count data or is that data lost after using hmmer. Thank you.

npcarter commented 1 year ago

Hello. It looks like you forgot to attach the output file. Also, when you say "count data", what exactly are you trying to count?

lilyyycao commented 1 year ago

Hello, I included a link with 3 files.

https://drive.google.com/drive/folders/1muP1CFGD1QTFrqDNfxiTBnqQDWHtBS1z?usp=sharing

The 2B.coords.gbk and 2B.protein.translations.fasta files are the output files I got after doing protein gene coding region predictions with prodigal.

I used the 2B.protein.translations.fasta with hmmer against the PfamA database to output 2B.pfam.out file. This is the following code I used:

hmmsearch Pfam-A.hmm 2B.protein.translations.fasta > 2B.pfam.out

What I would like to generate is something similar to an OTU table but with gene counts for each function for each sample so I can do differential expression.

npcarter commented 1 year ago

I don't believe we have an option that generates exactly what you're looking for, but I'd suggest that you take a look at the --tblout and --domtblout options to hmmsearch, which generate tablular output files that are intended to be easier for other programs to parse. --tblout generates an output file with one line per sequence-HMM comparison hit, while --domtblout generates a file with one line per domain that was deemed to be a significant match. My guess is that it'll be fairly easy to write a Python or other script that generates the table you're looking for from one of those output files.

lilyyycao commented 1 year ago

Thank you for your help, I will go ahead and take a look at those options!

lilyyycao commented 1 year ago

Hi again,

I wanted to ask as a follow up, when looking at the tbout file which seems to be in a closer format to what I would want to use, I noticed when annotating against the PfamA database that the description given isn't quite nicely listed as a specific function but instead more so as a code. I was wondering, is there an easy way to have an output file that lists the function in a more usable format?

For example, in the original out file, the Query is 1-cysPrx C but the description is 1 # 2 # 334 # 1 # ID=1750_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.682. Is there a way to have the description instead list something like C-terminal domain of 1-Cys peroxiredoxin instead of having to manually look up each accession/ID. I noticed in the original output file, it does list the more descriptive function on the top under the Query/Accession/Description header but not in the table itself.

Thank you!

-Lily

npcarter commented 1 year ago

I don't know of any way to modify what HMMER lists in the description field, sorry. It doesn't look like it'd be too hard to write a script to extract the full descriptions from the standard output and substitute them it the tblout, though. You might try that.