Target sequence length > 100K, over comparison pipeline limit

Valentin-Bio commented 1 year ago

Hello I'm trying to search hydrocarbon-degrading-genes over my metagenome assembled genomes (MAGs) using a hmm profile.

I'm testing the profile of the genes (my_genes.hmm) with one of my built genomes (genome1.fasta)

For this purpose, I first converted the nucleotide sequences to aminoacidic with transeq program:

transeq -sequence genome1.fasta -outseq genome1.faa

after that, I ran hmmsearch as the following:

hmmsearch --cpu 9 --tblout hydrocarbon_table.txt my_genes.hmm genome1.faa> hydrocarbon_results.txt

But I'm getting the title mentioned error message:

Fatal exception (source file p7_pipeline.c, line 697): Target sequence length > 100K, over comparison pipeline limit. (Did you mean to use nhmmer/nhmmscan?) zsh: abort hmmsearch --cpu 9 --tblout hydrocarbon_table.txt hydrocarbon.hmm >

Given the fact that the input file contain aminoacidic sequences longer than 100K , how can I deal with this problem ?

A solution to this problem. (I don't know if it is the optimal solution) is to use the hmmemit program to retrieve all the sequences from the profile and then run hmmscan of the retrieved aminoacidic sequences against the aminoacidic sequences of my MAG.

cryptogenomicon commented 1 year ago

There aren't yet any known proteins that long, to my knowledge. My guess is that you're translating your genome into six very long strings, with * or some such for stop codons, instead of into individual ORFs. HMMER expects proteins to be individual proteins, not concatenated whole genomes and not containing non-amino-acid characters. The solution is to translate into individual ORFs. One way to do this is with the esl-translate tool included with HMMER.

Valentin-Bio commented 1 year ago

Yes, my bad, I forgot to retrieve the ORFs from the genome, thanks for the clarification.

EddyRivasLab / hmmer

Target sequence length > 100K, over comparison pipeline limit #302