dieterich-lab / rp-bp

Rp-Bp is a Bayesian approach to predict, at base-pair resolution, ribosome occupancy and translation.
MIT License
7 stars 5 forks source link

ORF extraction occasionally extracts "wrong" ORFs #64

Closed bmmalone closed 7 years ago

bmmalone commented 7 years ago

It is not clear why, but ORF extraction seems to sometimes extract wrong ORFs. This appears to happen when there are start or stop codons near exon boundaries, but not exclusively.

bmmalone commented 7 years ago

An example of a bad ORF: ENSMUST00000159108_1:19063335-19064882

CDieterich commented 7 years ago

On Ensemble it says Gm15825-001 ENSMUST00000159108.2 918 No protein
Antisense

http://www.ensembl.org/Mus_musculus/Transcript/Sequence_cDNA?db=core;g=ENSMUSG00000089787;r=1:19063300-19104840;t=ENSMUST00000159108

bmmalone commented 7 years ago

The "No protein" part is okay (the ORF type would just be "noncoding"); however, the extracted ORF extends into the intron of that transcript (image below; top track is the annotation and the bottom track is the bad ORF); there is also nothing in the de novo assembly there. However, there is a start codon right at the end of the 3' exon in that image. I believe this is a corner case that is not handled correctly.

wrong-orf

wrong-orf-start

bmmalone commented 7 years ago

After looking more, it seems the problem for forward-strand ORFs is when an in-frame stop is the first codon for an exon.

wrong-orf-forward-end

For reverse-strand ORFs, the problem seems to be start codons at the 5' end of an exon.

wrong-orf-reverse-start

So this is a problem in misc.bio_utils.bed_utils.get_gen_pos

bmmalone commented 7 years ago

Upon even further inspection, the problem is not exactly with misc.bio_utils.bed_utils.get_gen_pos.

Consider a block structure like:

So, if we ask for the genomic coordinate of transcript coordinate 10, then the correct answer (which is returned by get_gen_pos) is 30.

However (following Ensemble conventions), stop codons are not included in ORFs. So, if a relevant (forward strand) stop codon begins at transcript coordinate 10, we really want to look at the genomic position which comes after transcript coordinate 9 (so, 19+1=20). N.B. This genomic coordinate need not actually be part of the transcript.

The problem is essentially the same for start codons on the reverse strand since the last base of the block structure is not included (so we want to point to one genomic position past the "A" in ATG, regardless of whether it is actually part of the transcript).

bmmalone commented 7 years ago

A bad ORF on the forward strand is: ENSMUST00000134384_1:4832348-4837000:+

bmmalone commented 7 years ago

This "subtract-one/add-one" fix breaks for start codons at the first position in the transcript.