EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
98 stars 18 forks source link

Smarter transcript padding #142

Closed lucventurini closed 5 years ago

lucventurini commented 6 years ago

At the moment Mikado performs a relatively stupid expansion: it checks whether the end of the other transcript is downstream of the one we are observing, and if it is, it will just expand the transcript until the end. However, this means crossing all the potential introns downstream, and therefore, result in a transcript with one or more retained introns (by definition).

There are four broad cases, that I can think of, and that we are treating in the same way. Notation: “A” is the transcript to be expanded, “B” the template. For simplicity, they are both on the “+” strand, and we are expanding their 3’ end. The reasoning would be analogous if we were reasoning in any of the other three potential orientations (“+” on the 5’ end, etc.)

lucventurini commented 6 years ago

The functionality is here, however, I should add some tests to make sure that we have covered at least all the basic cases.

lucventurini commented 6 years ago

Mikado now uses the internal interval tree of exon/intron segments to find all the overlapping segments. This should ensure that we can deal with all cases.

lucventurini commented 6 years ago

The procedure is now as follows:

This involved procedure has multiple fail-checks and should ensure that no transcript is modified in a way that:

lucventurini commented 5 years ago

Solved after confirmation by @swarbred

lucventurini commented 5 years ago

As noted by @gemygk and @swarbred:

lucventurini commented 5 years ago

As an addendum, transcripts should not be expanded if the boundary of the expandable transcript ends within a intron. In these cases both expansion options (ie creating a false intron or creating a massive exon) are non-desirable. So we should disable this.

lucventurini commented 5 years ago

@swarbred @gemygk

Refining the padding: a complex case

I am revising the algorithm for the padding. I have already added the part that will make aware Mikado of where a transcript ends (see 0b64818f615efe5562f93dac6cda31962c9298c1). The problem is that there ambiguous cases that need to be handled in a deterministic manner. Specifically:

t1:  |===|-----|====|--|====|----|====|
t2:  |===|-------------|====|----|=====|--------|==|
t3:  |===|-------------|====|----|=========|---------|====|
t4:  |===|-------------|====|----|=======|
t5:  |===|-------------|====|----|==|--------|===|
t6:  |===|--------|=======|----------------|====|

In this case:

Shifting to directional graphs

The way to break the conundrum:

So in our example the best choices would probably be, in order:

The final algorithm should therefore:

swarbred commented 5 years ago

@lucventurini the alternative would be that where you have multiple compatible transcripts which could be used for extension that first you check and eliminate options that would not meet ts_max_distance and ts_max_splices requirements and then of the remaining take the highest scoring transcript, if there is a tie I would be fine with any way of splitting this.

If I was manually annotating your example I would merge t1 into the "best" of the alternative compatible models i.e. which gave the longest CDS or had the most support from evidence. merging into the highest scoring compatible transcript would probably most closely reproduce my choice.

How are we currently dealing with this ? It sounds like a substantial change what you are suggesting.

lucventurini commented 5 years ago

Hi @swarbred, currently we are dealing with this in a way which is suboptimal, which basically ended up having a random choice. Moreover, as I was storing only the connection between two transcripts (so t1 <=> t4, not the direction, e.g. t1 => t4) I ended up having a hodgepodge. This was fine when the relationship was very linear (ie only expanding based on genomic coordinates) but was inefficient and breaks when shifting to the more sophisticated version of padding we are trying to implement.

Your suggestion of using the score of the transcript as our metric is extremely sensible, though, I will implement it as soon as I can.

lucventurini commented 5 years ago

Hi @swarbred , @gemygk , after e1b204d, now the padding should be fixed. As written above, now ties will be decided by the scoring. Although I have tried to test properly within the test suite, the best way will be to try it out on real data.

lucventurini commented 5 years ago

Currently the CDS padding is broken. To be fixed ASAP.

lucventurini commented 5 years ago

Hi @gemygk, @swarbred, @cschu, am I correct in saying that you have not found any new errors in the latest runs? if that is the case, we might close this issue.

lucventurini commented 5 years ago

Fixed as the current status.