Closed lucventurini closed 5 years ago
The functionality is here, however, I should add some tests to make sure that we have covered at least all the basic cases.
Mikado now uses the internal interval tree of exon/intron segments to find all the overlapping segments. This should ensure that we can deal with all cases.
The procedure is now as follows:
This involved procedure has multiple fail-checks and should ensure that no transcript is modified in a way that:
Solved after confirmation by @swarbred
As noted by @gemygk and @swarbred:
As an addendum, transcripts should not be expanded if the boundary of the expandable transcript ends within a intron. In these cases both expansion options (ie creating a false intron or creating a massive exon) are non-desirable. So we should disable this.
@swarbred @gemygk
I am revising the algorithm for the padding. I have already added the part that will make aware Mikado of where a transcript ends (see 0b64818f615efe5562f93dac6cda31962c9298c1). The problem is that there ambiguous cases that need to be handled in a deterministic manner. Specifically:
t1: |===|-----|====|--|====|----|====|
t2: |===|-------------|====|----|=====|--------|==|
t3: |===|-------------|====|----|=========|---------|====|
t4: |===|-------------|====|----|=======|
t5: |===|-------------|====|----|==|--------|===|
t6: |===|--------|=======|----------------|====|
In this case:
The way to break the conundrum:
So in our example the best choices would probably be, in order:
The final algorithm should therefore:
@lucventurini the alternative would be that where you have multiple compatible transcripts which could be used for extension that first you check and eliminate options that would not meet ts_max_distance and ts_max_splices requirements and then of the remaining take the highest scoring transcript, if there is a tie I would be fine with any way of splitting this.
If I was manually annotating your example I would merge t1 into the "best" of the alternative compatible models i.e. which gave the longest CDS or had the most support from evidence. merging into the highest scoring compatible transcript would probably most closely reproduce my choice.
How are we currently dealing with this ? It sounds like a substantial change what you are suggesting.
Hi @swarbred, currently we are dealing with this in a way which is suboptimal, which basically ended up having a random choice. Moreover, as I was storing only the connection between two transcripts (so t1 <=> t4, not the direction, e.g. t1 => t4) I ended up having a hodgepodge. This was fine when the relationship was very linear (ie only expanding based on genomic coordinates) but was inefficient and breaks when shifting to the more sophisticated version of padding we are trying to implement.
Your suggestion of using the score of the transcript as our metric is extremely sensible, though, I will implement it as soon as I can.
Hi @swarbred , @gemygk , after e1b204d, now the padding should be fixed. As written above, now ties will be decided by the scoring. Although I have tried to test properly within the test suite, the best way will be to try it out on real data.
Currently the CDS padding is broken. To be fixed ASAP.
Hi @gemygk, @swarbred, @cschu, am I correct in saying that you have not found any new errors in the latest runs? if that is the case, we might close this issue.
Fixed as the current status.
At the moment Mikado performs a relatively stupid expansion: it checks whether the end of the other transcript is downstream of the one we are observing, and if it is, it will just expand the transcript until the end. However, this means crossing all the potential introns downstream, and therefore, result in a transcript with one or more retained introns (by definition).
There are four broad cases, that I can think of, and that we are treating in the same way. Notation: “A” is the transcript to be expanded, “B” the template. For simplicity, they are both on the “+” strand, and we are expanding their 3’ end. The reasoning would be analogous if we were reasoning in any of the other three potential orientations (“+” on the 5’ end, etc.)