AlgoLab / PIntron

A novel pipeline for gene-structure prediction based on spliced alignment of transcript sequences (ESTs and mRNAs) against a genomic sequence
http://www.algolab.eu/PIntron
Other
5 stars 6 forks source link

Small exons not found #11

Closed yp closed 11 years ago

yp commented 12 years ago

RefSeq annotation of transcript NM_001127 of gene AP1B1 presents a "small" exon (9nt) at position (1-based inclusive, relative to the transcript sequence) 2954..2962, while PIntron factorizes the transcript as follows (last two exons):

2799 2962   92664 92825
2963 4178   94301 95516

(The last exon coincides with the reference annotation.)

PIntron version: v1.2.25 Input files: https://gist.github.com/14784a4ceec5e20f73d8 Command-line:

./bin/pintron --bin-dir=./bin/ -k --genomic=genomic.txt --EST=ests.txt --organism=human --gene=AP1B1 --output=pintron-full-output.json --gtf=pintron-cds-annotated-isoforms.gtf --extended-gtf=pintron-all-isoforms.gtf --logfile=pintron-pipeline-log.txt --general-logfile=pintron-log.txt
yp commented 12 years ago

The alignment of genomic sequence and transcript sequence for the second last exon is not perfect. In particular, the alignment of the suffix is as follows:

NM_001127    ...GCACGGACTTAGAG
                ||||||  |.||||
chr22        ...GCACGG--TGAGAG

The last 9 nucleotides of the suffix (GACTTAGAG) match exactly to the genomic sequence in a part (incorrectly) predicted as intronic. However, since the minimum factor length used by est-fact is 15, PIntron does not detect such a match and reports the wrong spliced alignment.

We propose, as done for issue #3 (commit e5dbb0e) and #4 (commit d137589), to post-process the spliced alignment in order to find potentially misplaced exon prefixes and/or suffixes (according to their edit distance) and to try to place them as new small exons. The new small exons should:

yp commented 12 years ago

The bug is still present in PIntron v1.2.53.

yp commented 11 years ago

It seems that the procedure predicts a considerable number of small exons which plausibly are alignment artefacts, instead. Moreover, in some cases, PIntron correctly detects a true small exon, but it fails to align it properly in such a way that two canonical introns are induced. An example of this second point is represented by RefSeq NM_014841 (gene SNAP91) which has a 6nt long exon annotated. PIntron (version v1.2.55) correctly identifies the small exon, but the computed alignment places the small exon on a genomic position which induces two non-canonical introns (GT-TT and AT-AG), while a different placement is able to induce two canonical introns (GT-AG).

While these two points seem unrelated, we believe that they stem from the same cause and, in particular, that the criteria used to identify small exons need to be more stringent than the ones presented above.

yp commented 11 years ago

According to an expert opinion, we elaborated the following strategy.

First, we accept an alignment presenting a small exon if and only if:

  1. it is at least 6nt long and perfectly matches to the genomic sequence; and
  2. the suffix of the previous exons and the prefix of the following exon perfectly match with the corresponding genomic sequence (for, at least, 6nt-8nt); and
  3. the two resulting introns are both classified as U2 or U12 (see file include/classify-intron.h).

During the alignment refinement phase, a small exon is searched for if at least one of the following conditions is met:

  1. the intron is not classified as U2/U12; or
  2. the edit distance of the suffix of the previous exon plus the edit distance of the prefix of the following exon is greater than 2

The procedure which detects possible small exons looks for the longest small exon which met conditions (1)-(3).