missing PP line - Githubissues

sjanssen2 commented 3 years ago

Looks like the PP line of a cmsearch full output is missing under specific circumstances.

The input sequence is mininp_subseq.fasta.txt (added .txt to be able to upload to here) File was retrieved from https://www.ncbi.nlm.nih.gov/nuccore/NZ_MJUD01000004.1?report=fasta&from=54694&to=55032

The covariance model is rliB_RF01471.cm.txt downloaded from rfam

Executed as cmsearch --cpu 1 rliB_RF01471.cm mininp_subseq.fasta using the master branch of infernal, hmmer, easel as of today, here is the config.log of configure

STDOUT is

# cmsearch :: search CM(s) against a sequence database
# INFERNAL 1.1.4 (Dec 2020)
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query CM file:                         ../covariance_rli/rliB_RF01471.cm
# target sequence database:              mininp_subseq.fasta
# number of worker threads:              1 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query:       rliB  [CLEN=360]
Accession:   RF01471
Description: Listeria sRNA rliB
Hit scores:
 rank     E-value  score  bias  sequence                       start    end   mdl trunc   gc  description
 ----   --------- ------ -----  ----------------------------- ------ ------   --- ----- ----  -----------
  (1) !   1.5e-15   62.2   9.4  NZ_MJUD01000004.1:54694-55032    339    122 -  cm    5' 0.31  Listeria seeligeri strain BCW_4759 PROKKA_contig

Hit alignments:
>> NZ_MJUD01000004.1:54694-55032  Listeria seeligeri strain BCW_4759 PROKKA_contig000004, whole genome shotgun sequence
 rank     E-value  score  bias mdl mdl from   mdl to       seq from      seq to       cyksc trunc   gc
 ----   --------- ------ ----- --- -------- --------    ----------- -----------      ------ ----- ----
  (1) !   1.5e-15   62.2   9.4  cm       16      360 ~]         339         122 - ~.   50.5    5' 0.31

                                                   v         v                                                       NC
                                    ~~~~~~:::<<<<<<<_________>>>>>>>-----------------------------.------------------ CS
                           rliB   1 <[15]*ugggUUUuaguUAcuUAuucugAAAuGuAaAUcuagAaaaAAGauAuaAguAAau.uAaaugUuuuUuggUUUU 88 
                                             GUUUUA UUACUUAUU UGAAAUGUAAAU U +  A+    + +A + A +U U  AUGU +   GGUUUU
  NZ_MJUD01000004.1:54694-55032 339 <[ 0]*GCAGUUUUAAUUACUUAUUGUGAAAUGUAAAUAUUUUUAUGUUCACAACAAAGUUuUGCAUGUCAAGCGGUUUU 266

                                                  vvvv                                    v    vv                    NC
                                    -------------((((((,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,<<<<---<<<<<<<<___________>>> CS
                           rliB  89 AgUUACUUAUUgUGaAAugUAAAUugcUcUcuuUugUguAAaucAuAucaagcaUguAGCGUUUuAGUUACUUAUUaUgA 168
                                    A  UACUUAUUGUGAAAUGUAAAU+   C  ++U G  U A+U+ U U   :CA  UAG  UUUUA UUACUUAUU+UGA
  NZ_MJUD01000004.1:54694-55032 265 AACUACUUAUUGUGAAAUGUAAAUAAAGCCGAAUAGGCUUAUUUUUUUACGACAAAUAG--UUUUAAUUACUUAUUGUGA 188

                                      vv   v                           vvvv                       NC
                                    >>>>>-->>>>,,,~~~~~~~,,,,,,,,,,,,,))))))::::::::::::::::::::: CS
                           rliB 169 AAUGUAAAugcGaU*[138]*acAUAGgAuUAUccaUUuCcGUUaUAAUUAUUccUUGcAA 360
                                    AAUGUAA UG:G+U        CA   G U A CC    CCGUU+UAA U  U++U  + A
  NZ_MJUD01000004.1:54694-55032 187 AAUGUAAAUGUGGU*[ 16]*CCAGCUGGUGAGCC----CCGUUUUAACUGCUUAUCAUGA 122

Internal CM pipeline statistics summary:
----------------------------------------
Query model(s):                                                  1  (360 consensus positions)
Target sequences:                                                1  (678 residues searched)
Target sequences re-searched for truncated hits:                 1  (2034 residues re-searched)
Windows   passing  local HMM SSV           filter:               5  (0.625); expected (0.35)
Windows   passing  local HMM Viterbi       filter:                  (off)
Windows   passing  local HMM Viterbi  bias filter:                  (off)
Windows   passing  local HMM Forward       filter:               5  (0.625); expected (0.02)
Windows   passing  local HMM Forward  bias filter:               4  (0.5); expected (0.02)
Windows   passing glocal HMM Forward       filter:               4  (0.5); expected (0.02)
Windows   passing glocal HMM Forward  bias filter:               4  (0.5); expected (0.02)
Envelopes passing glocal HMM envelope defn filter:               4  (0.5); expected (0.02)
Envelopes passing  local CM  CYK           filter:               4  (0.5); expected (0.0001)
Total CM hits reported:                                          1  (0.1608); includes 1 truncated hit(s)

# CPU time: 5.14u 0.12s 00:00:05.26 Elapsed: 00:00:05.26
//
[ok]

My parser expects every alignment block to have the PP line, which is missing here. Is that a bug or are there any good reasons why this line is missing?

Switching to global alignments -g produces the PP line, as well as --notrunc or a combination of both flags.

nawrockie commented 3 years ago

Hi @sjanssen2 ,

I hope that you are doing well and thanks for this detailed report, it was very easy to reproduce the behavior from the files and instructions you provided.

This is not a bug: cmsearch alignments may or may not have PP lines. Usually they have PP lines, but calculating the posterior probabilities requires a more memory intensive algorithm and if cmsearch determines that calculation will exceed the maximum allowed amount of memory than it fails over to using a different algorithm (CYK) and reports the most likely parsetree instead of the maximum expected accuracy parsetree (MEA parsetree, which maximizes sum of posterior probability of all emitted residues). If MEA is used, PP lines are output and if CYK is used PP lines are not output.

For parsing, you can tell that the PP lines will be absent if the header line prior to the alignment includes 'cyksc' instead of 'acc' as the 10th field.

For example: with PP:

 rank     E-value  score  bias mdl mdl from   mdl to       seq from      seq to       acc trunc   gc
 ----   --------- ------ ----- --- -------- --------    ----------- -----------      ---- ----- ----
  (1) !   1.1e-15   62.7   9.4  cm      144      360 ~]         339         122 - ~. 0.80    5' 0.31

without PP:

 rank     E-value  score  bias mdl mdl from   mdl to       seq from      seq to       cyksc trunc   gc
 ----   --------- ------ ----- --- -------- --------    ----------- -----------      ------ ----- ----
  (1) !   1.5e-15   62.2   9.4  cm       16      360 ~]         339         122 - ~.   50.5    5' 0.31

If you want to get PP lines more often, you can increase the maximum allowed memory with the --mxsize <x> option to set to <x> Mb. The default value is actually auto-determined based on the model size (CLEN). The autodetermined value will always be a minimum of 128 Mb and a maximum is 512 Mb, which again you can override using the --mxsize <x> option. The autodetermined value is capped at a relatively low value because the default behavior is to run multi-threaded and each thread may require <x> Mb.

However, it is not possible to guarantee that you will always have PP lines, so if you want a robust parser unfortunately you'll likely want to update the parser to handle cmsearch alignments without PP lines.

A couple more points that may or may not be relevant to what you are doing:

the tblout file will always includes the same fields and so should be easier to parse, but of course does not include the alignments
the --notextw option to cmsearch will make it so each alignment is in a single block, which may make it easier to parse.

sjanssen2 commented 3 years ago

Hi @nawrockie, hope you are doing fine as well. Thanks for your quick and elaborate reply. It makes total sense to me, however I could not see this from the documentation. The explanation for this PP line in your documentation - as far as I know - is only one sentence The bottom line represents the posterior probability (essentially the expected accuracy) of each aligned residue. I think it might be worth adding at least the information, that this is an optional line. Thus, you might avoid others to tap into the same pitfall as me. Regarding parsing: yes, I am interested in the alignments.

nawrockie commented 3 years ago

Good point about the documentation. Thanks - I will update it.

nawrockie commented 2 years ago

Added info on possible missing PP line to documentation (914ac0b)

EddyRivasLab / infernal

missing PP line #29