EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
305 stars 69 forks source link

Maximum sequence length HMMER can support #304

Closed perryXuu closed 1 year ago

perryXuu commented 1 year ago

Dear HMMER team,

I wonder does HMMER support building a validate profile for alignment that is very long? For an alignment file that has each sequence size being 4.4 Megabase, the profile we built based on it cannot be validated, i.e. yields "whoops, profile is bad" error when we want to do further things based on the profile (for example when we run hmmemit). We tried part of our alignment (100,000 in length) and it works fine, but not work for the full size. We wonder whether it's the problem of too long sequence that HMMER fail to support for some reason, or is there possibly any other reason?

Thank you!

Sincerely, Perry

cryptogenomicon commented 1 year ago

That's much longer than HMMER's designed to handle. We plan for query profile HMMs to be up to 100K residues long. We're expecting queries to be protein domains (~10-1000aa); or in the case of phmmer/jackhmmer, protein sequences (up to ~50Kaa); or in the case of nhmmer, alignments of DNA mobile elements that might get up to ~10-20Kb.

Aligning a 4.4Mb profile to a 4.4Mb sequence in H3 will cost 36x4.4Mx4.4M = about a petabyte of RAM, so alignments to larger queries aren't practical in our current design. There are also some numerical instabilities at larger sizes (roundoff error accumulations in probability calculations) that go outside our tolerances at larger sizes - that's where the "whoops, profile is bad" is coming from.