lucventurini commented 6 years ago

At the moment Mikado performs a relatively stupid expansion: it checks whether the end of the other transcript is downstream of the one we are observing, and if it is, it will just expand the transcript until the end. However, this means crossing all the potential introns downstream, and therefore, result in a transcript with one or more retained introns (by definition).

There are four broad cases, that I can think of, and that we are treating in the same way. Notation: “A” is the transcript to be expanded, “B” the template. For simplicity, they are both on the “+” strand, and we are expanding their 3’ end. The reasoning would be analogous if we were reasoning in any of the other three potential orientations (“+” on the 5’ end, etc.)

Exon vs exon. The last exon of transcript A ends within the last exon of transcript B.
- Current algorithm: expand until end of B.
- Proposed change: None.
Terminal exon vs. terminal intron. The last exon of transcript A ends within the last intron of transcript B.
- Current algorithm: expand until end of B.
- Proposed change: None.
Terminal exon vs. non-terminal exon. The last exon of transcript A ends within an internal exon of B.
- Current algorithm: expand the last exon until end of B.
- Proposed change: expand the last exon of A until its end is the same as the one of the overlapping exon. Add all the remaining exons of B.
Terminal exon vs non-terminal intron. The last exon of transcript A ends within an internal intron of B.
- Current algorithm: expand the last exon until end of B.
- Proposed change: expand the last exon of A until the end of the exon following the intron of B. Then, add the remaining exons to A.

lucventurini commented 6 years ago

The functionality is here, however, I should add some tests to make sure that we have covered at least all the basic cases.

lucventurini commented 6 years ago

Mikado now uses the internal interval tree of exon/intron segments to find all the overlapping segments. This should ensure that we can deal with all cases.

lucventurini commented 6 years ago

The procedure is now as follows:

After adding all the putative transcript events, check which ones have a score over the threshold, keep only the first N transcripts that pass this requirements (N=max_isoforms)
If we had to discard transcripts, recalculate metrics and scores, and recheck.
Check for retained introns. Depending on the settings, any transcript with retained introns will either be flagged or removed.
- If they are removed, recalculate metrics and scores, restart from 1.
If we have to execute the padding:
- Create a copy of the transcripts in the locus.
- Calculate the padded transcripts, keep track of those we used as templates for the padding
- Recalculate metrics and scores
- Check if we have created non-valid transcripts by padding:
- if the invalid transcripts have not been used as templates, discard them and continue the procedure.
- If they have, or we have made the primary transcript invalid: discard the invalid templates and restart.
- Using the score to derive the insertion order, check whether the modified transcripts are still valid ASEs.
- Check if any of the retained transcript has now a retained intron.
- If step e or f find invalid transcripts, repeat point d, ie:
- If the invalid transcripts have not been used as templates, discard them and continue the procedure.
- If they have, or we have made the primary transcript invalid: discard the invalid templates and restart.

This involved procedure has multiple fail-checks and should ensure that no transcript is modified in a way that:

It ensures that the primary transcript will not be made invalid
It ensures that no transcript will stay as ASE if it becomes an invalid ASE after padding
It ensures that no transcript will be padded according to the structure of a transcript we ended up discarding.

lucventurini commented 5 years ago

Solved after confirmation by @swarbred

lucventurini commented 5 years ago

As noted by @gemygk and @swarbred:

"ts_max_distance" should refer to the cDNA distance, not the genomic distance.
"Reference" transcripts are not considered in any particular way for the AS machinery. That means that they will always pass the requirements checks, but they might not be passing the requirements for ASEs. This should be made clear in the documentation.

lucventurini commented 5 years ago

As an addendum, transcripts should not be expanded if the boundary of the expandable transcript ends within a intron. In these cases both expansion options (ie creating a false intron or creating a massive exon) are non-desirable. So we should disable this.

lucventurini commented 5 years ago

@swarbred @gemygk

Refining the padding: a complex case

I am revising the algorithm for the padding. I have already added the part that will make aware Mikado of where a transcript ends (see 0b64818f615efe5562f93dac6cda31962c9298c1). The problem is that there ambiguous cases that need to be handled in a deterministic manner. Specifically:

t1:  |===|-----|====|--|====|----|====|
t2:  |===|-------------|====|----|=====|--------|==|
t3:  |===|-------------|====|----|=========|---------|====|
t4:  |===|-------------|====|----|=======|
t5:  |===|-------------|====|----|==|--------|===|
t6:  |===|--------|=======|----------------|====|

In this case:

T1 is the only expandable transcript: all the others are mutually incompatible
T1 is not compatible with T6 (it would mean extending an exon which is completely internal to an intron of the template)
T1 is fully compatible with T2, T3, T4
T1 might be compatible with T5. This depends on how we feel in adding an intron retention event (the last exon of T1 starts within the second-to-last exon of T5 and terminates within the last intron). How do we feel about this?
T2, T3, T4, T5 and T6 are all mutually incompatible in terms of expansion. T1 can be expanded according to the template of one and only one of the other transcripts.

Shifting to directional graphs

The way to break the conundrum:

store the relationship between the paddable transcripts in a directional graph.
store not only the direction (e.g. T1 could be expanded to T2) but also the distance that would need to be filled. This should take into account both the number of introns and the genomic distance.

So in our example the best choices would probably be, in order:

T4 (long exon elongation, no additional splicing)
T2 (short exon elongation, additional splicing event)
T3 (long exon elongation, additional splicing event)
T5 (?; long exon/intron extension).

The final algorithm should therefore:

link together T1 to all the valid alternatives
recognise T2, T3, T4 (,T5?) as multiple and incompatible "end points" of the path
prioritise each of the links according to the distance metric
choose one of the extensions, discard the rest
potentially we might want to backtrack if the extension becomes invalid. This however would further complicate the algorithm and require more development time.

swarbred commented 5 years ago

@lucventurini the alternative would be that where you have multiple compatible transcripts which could be used for extension that first you check and eliminate options that would not meet ts_max_distance and ts_max_splices requirements and then of the remaining take the highest scoring transcript, if there is a tie I would be fine with any way of splitting this.

If I was manually annotating your example I would merge t1 into the "best" of the alternative compatible models i.e. which gave the longest CDS or had the most support from evidence. merging into the highest scoring compatible transcript would probably most closely reproduce my choice.

How are we currently dealing with this ? It sounds like a substantial change what you are suggesting.

lucventurini commented 5 years ago

Hi @swarbred, currently we are dealing with this in a way which is suboptimal, which basically ended up having a random choice. Moreover, as I was storing only the connection between two transcripts (so t1 <=> t4, not the direction, e.g. t1 => t4) I ended up having a hodgepodge. This was fine when the relationship was very linear (ie only expanding based on genomic coordinates) but was inefficient and breaks when shifting to the more sophisticated version of padding we are trying to implement.

Your suggestion of using the score of the transcript as our metric is extremely sensible, though, I will implement it as soon as I can.

lucventurini commented 5 years ago

Hi @swarbred , @gemygk , after e1b204d, now the padding should be fixed. As written above, now ties will be decided by the scoring. Although I have tried to test properly within the test suite, the best way will be to try it out on real data.

lucventurini commented 5 years ago

Currently the CDS padding is broken. To be fixed ASAP.

lucventurini commented 5 years ago

Hi @gemygk, @swarbred, @cschu, am I correct in saying that you have not found any new errors in the latest runs? if that is the case, we might close this issue.

lucventurini commented 5 years ago

Fixed as the current status.

EI-CoreBioinformatics / mikado

Smarter transcript padding #142

Refining the padding: a complex case

Shifting to directional graphs