Closed yp closed 11 years ago
The alignment of genomic sequence and transcript sequence for the second last exon is not perfect. In particular, the alignment of the suffix is as follows:
NM_001127 ...GCACGGACTTAGAG
|||||| |.||||
chr22 ...GCACGG--TGAGAG
The last 9 nucleotides of the suffix (GACTTAGAG
) match exactly to the genomic sequence in a part (incorrectly) predicted as intronic.
However, since the minimum factor length used by est-fact
is 15
, PIntron does not detect such a match and reports the wrong spliced alignment.
We propose, as done for issue #3 (commit e5dbb0e) and #4 (commit d137589), to post-process the spliced alignment in order to find potentially misplaced exon prefixes and/or suffixes (according to their edit distance) and to try to place them as new small exons. The new small exons should:
The bug is still present in PIntron v1.2.53
.
It seems that the procedure predicts a considerable number of small exons which plausibly are alignment artefacts, instead.
Moreover, in some cases, PIntron correctly detects a true small exon, but it fails to align it properly in such a way that two canonical introns are induced.
An example of this second point is represented by RefSeq NM_014841 (gene SNAP91) which has a 6nt long exon annotated. PIntron (version v1.2.55
) correctly identifies the small exon, but the computed alignment places the small exon on a genomic position which induces two non-canonical introns (GT-TT
and AT-AG
), while a different placement is able to induce two canonical introns (GT-AG
).
While these two points seem unrelated, we believe that they stem from the same cause and, in particular, that the criteria used to identify small exons need to be more stringent than the ones presented above.
According to an expert opinion, we elaborated the following strategy.
First, we accept an alignment presenting a small exon if and only if:
include/classify-intron.h
).During the alignment refinement phase, a small exon is searched for if at least one of the following conditions is met:
The procedure which detects possible small exons looks for the longest small exon which met conditions (1)-(3).
RefSeq annotation of transcript NM_001127 of gene AP1B1 presents a "small" exon (9nt) at position (1-based inclusive, relative to the transcript sequence)
2954..2962
, while PIntron factorizes the transcript as follows (last two exons):(The last exon coincides with the reference annotation.)
PIntron version:
v1.2.25
Input files: https://gist.github.com/14784a4ceec5e20f73d8 Command-line: