Closed sjanssen2 closed 2 years ago
Hi @sjanssen2 ,
I hope that you are doing well and thanks for this detailed report, it was very easy to reproduce the behavior from the files and instructions you provided.
This is not a bug: cmsearch alignments may or may not have PP lines. Usually they have PP lines, but calculating the posterior probabilities requires a more memory intensive algorithm and if cmsearch determines that calculation will exceed the maximum allowed amount of memory than it fails over to using a different algorithm (CYK) and reports the most likely parsetree instead of the maximum expected accuracy parsetree (MEA parsetree, which maximizes sum of posterior probability of all emitted residues). If MEA is used, PP lines are output and if CYK is used PP lines are not output.
For parsing, you can tell that the PP lines will be absent if the header line prior to the alignment includes 'cyksc' instead of 'acc' as the 10th field.
For example: with PP:
rank E-value score bias mdl mdl from mdl to seq from seq to acc trunc gc
---- --------- ------ ----- --- -------- -------- ----------- ----------- ---- ----- ----
(1) ! 1.1e-15 62.7 9.4 cm 144 360 ~] 339 122 - ~. 0.80 5' 0.31
without PP:
rank E-value score bias mdl mdl from mdl to seq from seq to cyksc trunc gc
---- --------- ------ ----- --- -------- -------- ----------- ----------- ------ ----- ----
(1) ! 1.5e-15 62.2 9.4 cm 16 360 ~] 339 122 - ~. 50.5 5' 0.31
If you want to get PP lines more often, you can increase the maximum
allowed memory with the --mxsize <x>
option to set to <x>
Mb. The
default value is actually auto-determined based on the model size
(CLEN). The autodetermined value will always be a minimum of 128 Mb
and a maximum is 512 Mb, which again you can override using the
--mxsize <x>
option. The autodetermined value is capped at a
relatively low value because the default behavior is to run
multi-threaded and each thread may require <x>
Mb.
However, it is not possible to guarantee that you will always have PP lines, so if you want a robust parser unfortunately you'll likely want to update the parser to handle cmsearch alignments without PP lines.
A couple more points that may or may not be relevant to what you are doing:
the tblout
file will always includes the same fields and so should
be easier to parse, but of course does not include the alignments
the --notextw
option to cmsearch will make it so each alignment is
in a single block, which may make it easier to parse.
Hi @nawrockie,
hope you are doing fine as well. Thanks for your quick and elaborate reply. It makes total sense to me, however I could not see this from the documentation. The explanation for this PP
line in your documentation - as far as I know - is only one sentence The bottom line represents the posterior probability (essentially the expected accuracy) of each aligned residue.
I think it might be worth adding at least the information, that this is an optional line. Thus, you might avoid others to tap into the same pitfall as me.
Regarding parsing: yes, I am interested in the alignments.
Good point about the documentation. Thanks - I will update it.
Added info on possible missing PP line to documentation (914ac0b)
Looks like the
PP
line of acmsearch
full output is missing under specific circumstances.The input sequence is mininp_subseq.fasta.txt (added .txt to be able to upload to here) File was retrieved from https://www.ncbi.nlm.nih.gov/nuccore/NZ_MJUD01000004.1?report=fasta&from=54694&to=55032
The covariance model is rliB_RF01471.cm.txt downloaded from rfam
Executed as
cmsearch --cpu 1 rliB_RF01471.cm mininp_subseq.fasta
using the master branch of infernal, hmmer, easel as of today, here is the config.log ofconfigure
STDOUT is
My parser expects every alignment block to have the
PP
line, which is missing here. Is that a bug or are there any good reasons why this line is missing?Switching to global alignments
-g
produces thePP
line, as well as--notrunc
or a combination of both flags.