same strand overlapping ORFs

dieterich-lab / rp-bp

Rp-Bp is a Bayesian approach to predict, at base-pair resolution, ribosome occupancy and translation.

MIT License

7 stars 5 forks source link

same strand overlapping ORFs #96

Closed eboileau closed 5 years ago

eboileau commented 6 years ago

In a few cases, we predict upstream ORFs in the 3'UTR of a transcript, or vice-versa, but there may be many more latent in the initial set of ORFs. The problem is not exactly when labeling of the ORFs, but also in extracting the coordinates, so this happens first in extract-orf-coordinates.

This could happen for same sense overlapping transcripts, e.g. if we happen to parse ENSMUST00000155405.7 (Nudt8) before XM_017318269.1 (Doc2g), then any duplicate ORFs appearing in the second will be removed after dropping duplicates. When labeling the ORFs (label-orfs), however, because of the order in which operations are performed, the little green ORF, which is associated with the first transcript (ENSMUST00000155405.7), is then labeled as upstream, because it overlaps with the 5'UTR of the second (XM_017318269.1).

Even if the labeling would be consistent, this would not solve the problem of assigning the right category to this kind of predictions (i.e. is it downstream of one, or upstream of the other? Given its position here, it is more likely to be an upstream of the second CDS.) In fact, one could even argue that the yellow downstream ORF is problematic... It is not clear how to implement this kind of decision, or what to do exactly in such cases, but something should be done in the medium to longer term.

Also, I suspect that related cases may arise, even for transcript variants, when using more complex annotations.

overlap

eboileau commented 6 years ago

The most sensible way to deal with these particular cases would probably be to remove all duplicate ORFs completely, and label them separately. A given ORF could then be matched to a given transcript, and thus labeled unambiguously wrt to this transcript, based on its distance to the annotated CDS.

eboileau commented 6 years ago

...ok, I found some other examples... this also happens a lot when an ORF is found to overlap with the CDS of one transcript variant, as well as with the UTR of another; because we only use the id of the ORF to select the matches, a lot of these are given the overlapping label when they should not.

There are also cases where the labeling is correct, but only by chance, e.g when the ORF is assigned an id matching a transcript for which he is upstream (because of the order in which operations are performed, as explained above.)

bmmalone commented 6 years ago

Hi Etienne,

The ORF id's are not used for annotating the types (because of the issues you mentioned). Bed operations are used to find regions which overlap annotations. However, there is a relatively arbitrary precedence order about how the ORF types are assigned. I believe that roughly "explains" the annotations you are seeing.

For the first problem (i.e., both upstream and downstream), I am not sure there is a "non-arbitrary" approach. Right now (not really be design), preference is given to downstream. The logic is in lines 300-350 of label-orfs.

For the 'overlapping' ORFs, you can try the --filter flag for label-orfs. That should at least take care of the ORFs which are completely covered by the annotations. However, the various upstream and downstream types have precedence over the "suspicious" type (i.e., overlapping annotated CDS regions in some weird way; they have the lowest precedence and are assigned at the end of the script).

I'm not sure that's very satisfying, but it is at least why the annotations are the way they are. You can update label-orfs to affect the precedence order.

eboileau commented 6 years ago

Hi Brandon, thanks for your advice.

The types are determined using coordinates, yes, wrt to annotations, but the filtering is actually done using the b_info, which is the ORF id. This is problematic in cases where we have, e.g 2 transcript variants in the annotations, and an ORF that overlaps with the CDS of one transcript variant (and not with the UTR), as well as being entirely in the UTR of another. It would be wrongly labeled as five_prime_overlap or three_prime_overlap. This is a genuine problem.

For the first problem, yes, I agree that we need to make a choice. However the issue is that currently an ORF could be assigned one ID (in extract-orf-coordinates), then labeled wrt another transcript variant (in label-orfs), and this is why we observed these discrepancies, e.g. above, the green ORF is assigned an id associated with the transcript ENSMUST00000155405, which is the leftmost CDS, but labeled as upstream, which clearly wrong wrt to this transcript

I think I understand what you mean by the ORF id's are not used for annotating the types, but I think that an ORF should nevertheless be labeled wrt to this id (i.e. the transcript part of the id), otherwise the predictions are too difficult to interpret!

Yes, I have a plan to address these issues over the next few weeks/months.

eboileau commented 6 years ago

This is one example of wrongly labeled as five_prime_overlap. This shows the overlap with the CDS of one transcript

transcript_overlap(a_info='NM_001277101.1', b_info='NM_013715.2_1:10037876-10037975:-', overlap=57, a_fraction=0.05993690851735016, b_fraction=0.5757575757575758)

and the UTR of another

transcript_overlap(a_info='NM_013715.2', b_info='NM_013715.2_1:10037876-10037975:-, overlap=99, a_fraction=0.28530259365994237, b_fraction=1.0)

Because we only use b_info (appearing in both) to select the mask, then we say it's five_prime_overlap. However, it is not overlapping the UTR of NM_001277101.1 (it can be seen on IGV/GB), and not the CDS of NM_013715.2, i.e. it is entirely in the UTR of NM_013715.2 (b_fraction=1.0), so really it should be labeled as five_prime.

bmmalone commented 6 years ago

I see what you mean about these types of ORFs. The problem is on Lines 316 and 317 of label-orfs. Basically, an ORF is assigned the "overlap" types if it both overlaps any UTR and any CDS region. It should check that the overlaps are for the same transcript. You might still be able to build a set of tuples, or a data frame, and use efficient intersections or joins for that.

While this example probably wouldn't have this issue, one thing to keep in mind is that if transcript 1 is translated, then this ORF will (or at least can) show translational activity, too, simply because it is a part of transcript 1. Thus, we have to be careful, or we may think there is something interesting upstream of transcript 2, when it's really just a part of normal, boring ole transcript 1. We did run into these kinds of issues in previous iterations of the labeling.

Good luck :)

eboileau commented 6 years ago

There are actually more interesting cases due to the way overlap types were assigned in the UTRs. For instance, many ORFs were assigned the label five_prime_overlap or three_prime_overlap, whereas they are in fact noncoding! The reason for this is as follows: because we would not assign them the five/three_prime_overlap label anymore, these ORFs/exons would stay in the game, but there is no guarantee that they actually would receive the five/three_prime label, since they might not be entirely covered by the matching UTR (when looking at overlaps, we do not specify any restrictions, but when looking at the UTRs, we required that min_b_overlap=1), so they might as well stay in the game... and it looks like that they are actually and correctly mapped to a non-coding transcript!

bmmalone commented 6 years ago

Ah, I see what you mean.

My main motivation was to avoid calling something five_prime or three_prime unless it really only, and completely, occurred in the respective UTR region since "we" were most excited by those label types. (I'm not sure that was accomplished, but that was at least the main focus.)

However, even when using only ensembl annotations (no additional refseq, de novo assembly, etc.), there were still a ton of corner cases. Thanks, diversity-of-life... ;)

P.S. FWIW, we also checked prediction annotations from ribotaper, etc., and they exhibit similar problems.

eboileau commented 6 years ago

Ok... credit to Brandon... it's actually more complex than I thought...

After relabeling, I still found some mislabeled ORFs! The problem is still how to treat the upstream/downstream overlap labels five/three_prime_overlap. Although we now require that the overlap is within the same transcript, we should in fact check that the match occurs entirely within the union of both canonical_orf_exons and three/five_prime_exons, i.e. it must at least respect the transcript structure.

This is one example of a predicted upstream ORF (green), wrt the transcript shown by the red arrow. It does not even have the same structure... but because it overlaps both its CDS and UTR, it was labeled as five_prime_overlap. Note that it cannot either be five_prime_overlap wrt any of the transcript variants. Given the annotations, it should have mapped to the non-coding transcript shown by the black arrow.

overlap_problem

I think I can look for all transcript overlaps that are fully contained, and add an extra condition. This would require creating the annotated_exons in all cases (not only with de-novo), and then an extra call to get_bed_overlaps.

eboileau commented 6 years ago

This is a clear example of mislabeled five_prime_overlap which should map to the retained intron (red) annotated as non-coding.

overlap_problem2

In all cases, I have addressed these issues and remapped the labels for the data I'm currently analysing; but I do not yet address cases such as the first example (i.e. genuine cases where we need to make a choice as to whether the ORF is upstream or downstream, etc.). Besides, this still needs to be implemented in Rp-Bp, so I will leave this issue open meanwhile.

Note This works for 'annotated ORFs', but will not work for a de-novo assembly, since all five/three_prime_overlap candidates would be filtered out if we require that they are contained within the transcript structure. I guess we'll need to make a design choice here, justified by biology. Another thing I've noticed is that we also get a few, not many, novel_within, or novel_five_prime, but this should not be the case in principle, since these ORFs should have been filtered out using the --filter option...

bmmalone commented 6 years ago

This is not really the type of thing I like getting credit for... ;)

eboileau commented 6 years ago

The code should now deal with all these cases.

The label overlap was added to avoid assigning a preference to these ORFs that overlap both the 5'UTR and 3'UTR of the same transcript. All other ambiguous cases described above are dealt with in a manner that is respecting the initial design choice, except for a few things, in particular the out-of-frame within ORFs. I don't think there are many such cases (generally if using a standard set of annotations), but this is one example.

within

The within ORFs are selected from truncated matches, if they don't match a stop codon from an annotated ORF. However this results in a few corner cases (such as the cyan ORF, associated with ENSMUST00000161892) labeled as within, whereas they are clearly not. I've tried labeling within after overlap ORFs, but this only shifts the problem to overlap. One way to avoid this would be to take into account the intron structure, using bed overlaps...

For all other ambiguous overlap, in a standard assembly, we also require an extra call to get_bed_overlaps, to make sure all overlap matches are fully contained within the transcript structure. For de novo, we cannot enforce this. Now, all novel overlapping ORFs (including within, i.e. cases where the ORF extends inside, but is contained within the ends of the transcript) are now labeled as overlap, without any distinction whether they are upstream or downstream, overlaping the annotated coding sequence or just the UTRs, it suffices that they overlap an annotated region. I think this makes more sense, since anyway for these novel ORFs many would extend for large portion of the annotated transcripts.

eboileau commented 5 years ago

The last commit should solve critical labelling issues, but a proper test suite should be implemented, if we want to keep proceeding that way. Taking into account the intron structure, etc. would make more sense.

The labels are currently written to two files, in the original list of ORFs, and in a new orfs-labels file, which also contains duplicate ORF ids for reference (I've decided not to adjust the transcript part of the ORF id, this is a waste of time, and does not solve all mismatches anyway). In the next major release, only the latter will have the labels, keeping somewhat separate the ORF prediction pipeline and this whole labelling business.

The labels have also been simplified to be less restrictive, e.g we do not differentiate between truncated and extended (sometimes can be both, thus in general we label them as canonical variant). Besides, many may just be artifacts (not enough footprint in some part of the ORF, etc.).