Closed bmmalone closed 7 years ago
An example of a bad ORF: ENSMUST00000159108_1:19063335-19064882
On Ensemble it says
Gm15825-001 ENSMUST00000159108.2 918 No protein
Antisense
The "No protein" part is okay (the ORF type would just be "noncoding"); however, the extracted ORF extends into the intron of that transcript (image below; top track is the annotation and the bottom track is the bad ORF); there is also nothing in the de novo assembly there. However, there is a start codon right at the end of the 3' exon in that image. I believe this is a corner case that is not handled correctly.
After looking more, it seems the problem for forward-strand ORFs is when an in-frame stop is the first codon for an exon.
For reverse-strand ORFs, the problem seems to be start codons at the 5' end of an exon.
So this is a problem in misc.bio_utils.bed_utils.get_gen_pos
Upon even further inspection, the problem is not exactly with misc.bio_utils.bed_utils.get_gen_pos
.
Consider a block structure like:
So, if we ask for the genomic coordinate of transcript coordinate 10, then the correct answer (which is returned by get_gen_pos
) is 30.
However (following Ensemble conventions), stop codons are not included in ORFs. So, if a relevant (forward strand) stop codon begins at transcript coordinate 10, we really want to look at the genomic position which comes after transcript coordinate 9 (so, 19+1=20). N.B. This genomic coordinate need not actually be part of the transcript.
The problem is essentially the same for start codons on the reverse strand since the last base of the block structure is not included (so we want to point to one genomic position past the "A" in ATG, regardless of whether it is actually part of the transcript).
A bad ORF on the forward strand is: ENSMUST00000134384_1:4832348-4837000:+
This "subtract-one/add-one" fix breaks for start codons at the first position in the transcript.
It is not clear why, but ORF extraction seems to sometimes extract wrong ORFs. This appears to happen when there are start or stop codons near exon boundaries, but not exclusively.