TomSkelly / MatchAnnot

Python scripts for matching output of Pacific Biosciences IsoSeq RNA-seq pipeline to an annotation file.
GNU General Public License v3.0
24 stars 13 forks source link

Exon solely consisting of inserted sequence #14

Open juheon opened 5 years ago

juheon commented 5 years ago

Let's say that the transcript contains a novel exon which is the result of insertion. This exon has 0 bp aligned to the reference genome since the entire exon sequence is the result of insertion.

In that case, matchAnnot.py returned following error.

Traceback (most recent call last):
  File "~/MatchAnnot/matchAnnot.py", line 546, in <module>
    main()
  File "~/MatchAnnot/matchAnnot.py", line 179, in main
    bestTran, bestScore = matchTranscripts (exons, curGene)     # match this cluster to all transcripts of gene
  File "~/MatchAnnot/matchAnnot.py", line 290, in matchTranscripts
    showCoords (readExons, tran)
  File "~/MatchAnnot/matchAnnot.py", line 460, in showCoords
    printMatchingExons (ixR, ixT, readExons[ixR], tranExons[ixT])
  File "~/MatchAnnot/matchAnnot.py", line 491, in printMatchingExons
    print ' sub: %2d  Q: %4.1f' % (exonR.substs, exonR.QScore()),  # comma: line continued in printStartStop
  File "~/MatchAnnot/CigarString.py", line 61, in QScore
ZeroDivisionError: float division by zero

If you look into the matchAnnot.txt, it will be like

tr:       PREB-201              sc: 2  ex:  9   2166  id: ENST00000260643.6_1    [1] [2] [4] [5] [6] [7] [8] [9]
exon:                 1   1    27353624   27353624      0   27354192   27354376    184      len:  569  753  ins: 95  del: 17  sub: 43  Q:  5.6   stop:   -94
exon:                 2   2    27354675   27354540   -135   27354674   27354699     25      len:    0  160  ins: 1176  del:  0

This is wrong since the second exon of the PacBio read does not locate within the second exon of GENCODE transcript. For the exon 2 of PacBio read, since the size of alignment is 0, we cannot say that the exon 2 of PacBio read matches with the exon 2 of GENCODE transcript.

The matchAnnot.txt should be like

tr:       PREB-201              sc: 2  ex:  9   2166  id: ENST00000260643.6_1    [1] [] [4] [5] [6] [7] [8] [9]
exon:                 1   1    27353624   27353624      0   27354192   27354376    184      len:  569  753  ins: 95  del: 17  sub: 43  Q:  5.6   stop:   -94
exon:                 .   2           .   27354540      .          .   27354699      .      len:    .  160
exon:                 2   .    27354675          .      .   27354674          .      .      len:    0    .  ins: 1176  del:  0  sub:  0  Q:  0.0

I modified two functions to fix the problem 1) matchAnnot.py

def findOverlaps (list1, list2):
    ...
    for pos1 in xrange(len(list1)):     # for each list1 entry, find all list2 which overlap it        
        nLen_readexon=list1[pos1].end-list1[pos1].start+1; #added line
        if nLen_readexon < 1:                                                   #added line
            pos1 += 1                                                                  #added line
            continue;                                                                  #added line

2) CigarString.py

def QScore (self):
        ...
        if self.substs is None:
            Q = 0.0            # no MD string was supplied, can't compute Q score
        elif (self.end - self.start + 1)<1:    #added line
            Q=0.0;                                        #added line

Please let me know if any changes I have made is wrong