Potential retained intron bug RC6

swarbred commented 4 years ago

I'm seeing a difference between mikado-2.0rc4 and mikado-2.0rc6_* versions in relation to calling and exclusion of transcripts with retained introns.

I'm attaching a screenshot of this region http://apollo.tgac.ac.uk/Myzus_persicae_O_v2_genome_browser/jbrowse/?loc=scaffold_1%3A58658101..58667080&tracks=DNA%2CAnnotations%2CMikado_annotation_run6_classification%2CMikado_integration_run4%2CScallop_lncRNA%2CStringtie_lncRNA%2CYa_locus&highlight=

The input models are shown in track ..run6_classification the output models of running mikado version mikado-2.0rc6_3f62484_CBG are shown in track mikado integration run 4

Two models are excluded from the original input mikado.scaffold_1G6704.4 (correctly as a retained intron transcript) and mikado.scaffold_1G6704.3

This mikado.scaffold_1G6704.3 model I dont think should be viewed as a retained intron transcript and running previous versions mikado-2.0_rc1 and mikado-2.0rc4 on the same input models and config gives mikado.scaffold_1G6704.3 in the output.

For my own knowledge Luca can you confirm that for the retained intron check the order (i.e. the relative scoring) of the transcripts matters i.e. each potential alt splice model is assessed against the primary model and the other models currently added to the locus. So if a transcript is the second highest scoring it might not be regarded as having a retained intron relative to the primary model but if the same transcript scored lower i.e. other transcripts were added before the retained intron check was made then potentially against these it now may have a retained intron and be excluded.

Correct version mikado-2.0rc4

sbatch -p ei-cb -c 1 --mem 20G -o out_mikado.serialise-and-pick.run4.%j.log -J Ov2_Mikado_SP --wrap "source mikado-2.0rc4 && /usr/bin/time -v mikado pick --mode nosplit --seed 10 --procs 1 --json-conf mikado.configuration.integration.run4_scaffold_1_58658101_58667080.yaml --subloci_out mikado.subloci.gff3 --monoloci_out mikado.monoloci.gff3 --output-dir ./integration_run4_dstest10 -lv DEBUG"

output directory

/tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0_rc1_run6/integration/integration_run4_dstest10

Incorrect? version

sbatch -p ei-cb -c 1 --mem 20G -o out_mikado.serialise-and-pick.run4.%j.log -J Ov2_Mikado_SP --wrap "source mikado-2.0rc6_3f62484_CBG && /usr/bin/time -v mikado pick --mode nosplit --seed 10 --procs 1 --json-conf mikado.configuration.integration.run4_scaffold_1_58658101_58667080.yaml --subloci-out mikado.subloci.gff3 --monoloci-out mikado.monoloci.gff3 --output-dir ./integration_run4_dstest8 -lv DEBUG"

output directory /tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0_rc1_run6/integration/integration_run4_dstest8

lucventurini commented 4 years ago

Hi David, the "find_retained_introns" method has not changed between the two releases. Another issue: the wrong model should be mikado.scaffold_1G6704.4|mikado_all_mikado.scaffold_1G8446.3.

However, I cannot find it among in the mikado.loci.gff3 in neither folder you indicated:

Supposedly mikado v. 2.0rc4: /tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0_rc1_run6/integration/integration_run4_dstest10

This should have 3 output models, but I actually see 4. Extra model: Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.5

Supposedly mikado v. 2.0rc6:

/tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0_rc1_run6/integration/integration_run4_dstest8

This should have 5 output models, but I see 3. That is, I see the three models that in Apollo are assigned to 2.0rc4.

Would you please be able to confirm the folder paths?

swarbred commented 4 years ago

hi @lucventurini

Paths and versions used look correct to me.

directory integration_run4_dstest10 - mikado-2.0rc4

correctly outputs 4 models: alias=Mikado_annotation_run6_ptn_coding_HC_repTrue_mikado.scaffold_1G6704.1 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.2 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.3 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.5

with Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.4 correctly excluded as it contains a retained intron

directory integration_run4_dstest8 - mikado-2.0rc6_3f62484_CBG outputs alias=Mikado_annotation_run6_ptn_coding_HC_repTrue_mikado.scaffold_1G6704.1 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.2 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.5

so the difference is alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.3 being excluded in the mikado-2.0rc6_3f62484_CBG run.

This model being excluded seems to be an error, if you change to keep_retained_introns: true then this model will be returned and you can see in the log that it's being flagged as a retained intron transcript.

Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.3 has 1 retained introns ([(58664468, 58665080)])

lucventurini commented 4 years ago

Hi @swarbred , thank you for the clarification, I had misunderstood the problem.

I will have a look at what is causing the change in behaviour and report back.

lucventurini commented 4 years ago

Hi @swarbred, @gemygk,

I traced the cause of the disparity. It is explained explicitly here:

Fixed in 443e9e2. Now Mikado will do the following:

if a transcript is non-coding, consider all its exons.

If a transcript is coding: -- if the exon is a complete (ie fully UTR) 3' UTR exon, do not consider it further. -- If the exon is completely coding, do not consinder it further. -- if a transcript is partially coding: --- if it is in the 5' UTR, consider all the 5'UTR (to detect a delayed Met due to an upstream retained intron) --- else, only consider the boundary between the coding and non-coding areas. These changes ensure that only transcripts where there is a reasonable suspicion that either the Met or the stop codon are misplaced will be marked as having a retained intron.

Please note the bold part.

I do think that these changes are correct, and I remember us discussing them at some point. However, I also understand that in this specific case, where the Met start did not change due to the retained intron, you might see this result as incorrect.

What should the behaviour be in instances like this?

swarbred commented 4 years ago

HI @lucventurini @gemygk

Strictly I wouldn't view the original transcript i queried as having a retained intron (green box below)

Screen Shot 2020-01-06 at 14.38.21.pdf

In this case the model contains a unique splice site, a retained intron should extend through the entire intron sequence of another model. We might have loosened this definition in the original logic as in RNA-seq assemblies we will have models where the terminal extends into the intron and we might just view the model as being a fragment and still want to flag these as retained (see diagram below).

I'm not sure if I follow the logic you have described above, it would seem to suggest that the example in the green box (i.e. a complete UTR exon) isn't a retained intron but in this case the model was identified as such by mikado-2.0rc6_3f62484_CBG

I've since run mikado-2.0rc1 and mikado-2.0rc6_3f62484_CBG on the same data and have this example

Screen Shot 2020-01-14 at 17 08 39

The model selected in red was marked as NOT having a retained intron in mikado-2.0rc6_3f62484_CBG but was in mikado-2.0rc1 . I would view mikado-2.0rc1 as correct here.

How would the current logic deal with the following examples, which would be flagged as having a retained intron.

20200114_165214

I would say #4 and #6 have a retained intron and potentially #2 and #3 if you make a special case for terminal exons (not sure how either the new or old logic would handle these).

lucventurini commented 4 years ago

Hi @swarbred , yes I see what you are saying. Regarding the example we are discussing (hence the green box in the PDF) the exon gets flagged as retained because:

1- it is completely non-coding and in the 5' UTR 2- it has at least some part which falls within another intron (again in the 5' UTR)

In the second example you are flagging, the exon does not get remarked as a retained intron because: 1- the exon is only partially non-coding 2- the CDS start is outside of the intron itself.

The aim, which is proving quite hard to implement, is straight-forward: mark as retained introns only those retentions that could affect the CDS for coding transcripts. This is because Mikado should consider retained introns as legitimate ASs unless they disrupt the coding region.

Hence why probably in both cases above Mikado should not consider either, I think, as retained introns (ie cause for exclusion).

Coming to your diagram (but we would have to verify the behaviour of Mikado with a mock-up):

2 is a retained intron (new stop within the CDS intron of #1)
3 is a retained intron (new start within the CDS intron of #1)
4 is a retained intron (new stop within the intron of #1 )
5 is not clear to me, is its 5'most splicing site meant to be within the first intron of #1?
6 is not a retained intron, the new stop exists because of a novel exon - not because the pre-existing exon "bleeds" into the intron.

lucventurini commented 4 years ago

PS: the function is defined here. It has two steps:

determine whether the exon is a candidate as retained intron. This is implemented here. Steps:
1. Check if the exon is coding and whether it is at the 3' end.
  - if the exon is a completely non-coding 3'UTR, exclude from consideration.
  - if it is a completely coding non-terminal exon, exclude from consideration.
2. Extract the fragments of the exon to be checked:
  - if the exon is a terminal completely coding exon, return it wholesale
  - if the exon is completely non-coding, return the whole exon
  - otherwise, one or more fragments of the exon will be taken into consideration
if the exon is a possible retained intron, analyse the exon and the fragments determined above, with a separate function.
1. find all introns that are overlapped by the candidate exon. If none is found, exclude from consideration and continue.
2. for each intron:
  1. determine whether the candidate exon overlaps either exon of the intron. If the candidate exon is completely contained within the intron, exclude from consideration and continue.
  2. if the candidate intersects only one of the two boundaring exons, mark it as retained only if it is a terminal exon.
  3. else, consider it as a retained intron, unless the CDS completely spans the intron.

The logic, again, is to mark as a "retained intron" any exon for which we have evidence that the CDS got interrupted because of not having respected the intron boundaries - or, alternatively, because it is a retained intron in the 5' UTR.

lucventurini commented 4 years ago

Hi @swabred, following the discussion today, it might be a good idea to find a good set of papers laying down widely accepted definitions of "retained intron". I have found this preliminary panel of articles to read:

Please add any other you think relevant!

swarbred commented 4 years ago

HI @lucventurini

There is 1) what's the definition of intron retention event and 2) how we want to apply this in Mikado the two probably are not the same.

On my hand diagram I think only 4 and 6 represent what would be regarded as strictly intron retention. Number 5 would be an alternative donor splice site, number 2 could be an alternative Polyadenylation site (if you take the transcript as being full-length), number 3 would be an alternative transcription start site (again if you take the transcript as full-length).

So if we apply what I believe is the general use of retained intron we would filter 4 and 6 (point me to a source if you disagree).

However the original intention was for this to filter out transcripts that were common to RNA-seq assemblies which we viewed as undesirable, this would have included 2 and 3 as we viewed these as fragments rather than full-length transcripts i.e. we have only a partial assembly and the true start and end would give the transcript a structure with a full intron retention event. If the transcript was known to be a full-length transcript then I would view these two examples differently.

I think this also ties in with the CDS /cDNA overlap check we apply, I believe we may have changed this (but correct me if i'm wrong), lets assume the cds overlap is applied and set to 80% I think originally examples like number 2 would have met this threshold as greater than 80% of the shorter transcript (2) is contained in the larger number 1 model, I think we switched how this is handled and now number 2 would be filtered out as the cds overlap is relative to the longer transcript number 1.

If this overlap logic is indeed changed then it might mean that the "keep retained introns" flag isn't as relevant for identifying and filtering some types of transcripts.

The original purpose was (from my perspective) was to filter out undesirable transcripts that we were not pulling out through other filters we apply as part of the alt splice checks i.e like cases where the penultimate exon is retained (2 and 3 below) as such transcript will be fairly common in RNA-Seq assemblies and are likely not filtered out by cds/cdna overlap or UTR requirements. Hence the focus on intron retention in CDS exons.

[]--[]--[]--[] []--[]--[_] []--[]--[____]

I'm not clear why we make a distinction between 5' and 3' UTR exons for purposes of retained introns other than you have more 5' UTR exons than 3' UTR and they wont affect the CDS, the green box example is not something that I think we intended to exclude (in that same snapshot the red box transcript is a genuine retained intron). This part

"if the candidate intersects only one of the two boundaring exons, mark it as retained only if it is a terminal exon."

is to catch number 2 in my example not the green box example where the exon while terminal splices on to another exon. Also if I read the logic correct if in the green box example we added another 5' exon (the same as the top transcript in the pic) then this would then NOT be classed as retained as it wouldn't be a terminal exon.

If the logic was to ignore entirely coding exons, and then check partial and UTR exons to see if they overlap introns, for those that do then for the:

internal exons they would need to span an entire intron and share the same exon1 start and exon2 end of the introns two flanking exons

for terminal exon they would need to only share the same internal exon start coordinate with the exon of the overlapping intron to be regarded as meeting the retained intron requirement.

Another set of examples as this is not easy to follow, those marked with a cross is what I would want this filter to exclude.

What do you think?

20200115_174504

lucventurini commented 4 years ago

I think my only disagreement with your figure is case 9: I would exclude it, as the stop is clearly an NMD due to intron retention, and seems to be the non-fragmentary version of 8. Thoughts?

In practice I was trying to use the retained intron flag in Mikado to do something different from what you were describing - ie identify and filter out NMD induced by the retained introns, as well as transcripts where the atg was mangled by a retained intron in the 5' UTR. Importantly, this means also excluding full length transcripts that are actually real (if transient), not just artefacts coming from the sequencing or assembly protocols.

lucventurini commented 4 years ago

@swarbred Rereading your comment above I can see why you would class no. 9 as a non retained intron. From what I understand your logic could be formalised as follows:

if an exon is not terminal, class it as a retained intron if and only if it starts at an exon A of another transcript and ends at the successive exon B of the same transcript. Exception if the CDS completely spans the intron.
if an exon is terminal, consider it as a retained intron if and only if it starts at the exon of another transcript, it bleeds into the intron, and the CDS is interrupted within the intron.

So in practice we would be adding the additional constraint that a retained intron event must have at the 5' the same exon start as the template, correct transcript. I'm not clear on the 3' end. My questions on this would be:

what happens if the exon is not terminal but ends up spanning the whole intron, ending in exon B of the template, but not at the boundary of B? Or even spans more than one intron/exon? I'm thinking about the completely immature RNA here, with most of the introns still unspliced.
The 5' end is always harder to call precisely. What happens with case no. 6 or a hypothetical no. 6" where the starting e on begins within the first exon of the template but not at the exact boundary?

swarbred commented 4 years ago

if and only if it starts at an exon A of another transcript and ends at the successive exon B of the same transcript

or exon C, D in the case it retains multiple introns

So in practice we would be adding the additional constraint that a retained intron event must have at the 5' the same exon start as the template, correct transcript. I'm not clear on the 3' end. My questions on this would be:

for the terminal exons what matters is that the internal splice site of the exon matches the reference

what happens if the exon is not terminal but ends up spanning the whole intron, ending in exon B of the template, but not at the boundary of B? Or even spans more than one intron/exon? I'm thinking about the completely immature RNA here, with most of the introns still unspliced.

If it spans the whole intron but ends within exon B then this is an alternative splice site and not a retained intron. If it spans multiple introns but still ends on an exon boundary then it's a retained intron.

The 5' end is always harder to call precisely. What happens with case no. 6 or a hypothetical no. 6" where the starting e on begins within the first exon of the template but not at the exact boundary?

Based on what i described number 6 should have been marked as a cross as it's a terminal exon and it's internal splice site is shared with the reference transcript and it's internal exon is within the intron (exon would also meet the criteria( of the reference transcript.

where the starting e on begins within the first exon of the template but not at the exact boundary?

Based on what i wrote even landing in the first exon would meet the criteria. However in practice from RNA-Seq assemblies you will likely see less examples of these most will end in the intron as the coverage drops.

Number 5,6 and 8 are challenging as if these are full length transcripts then they are not retained intron events.

I change my mind on this the more i think about it. To be honest how this filter is named is partly the issue as it suggests the intention is to filter out all retained introns which it doesn't do and wasn't the original intention. I think the original intention was to mainly catch number 5 and 8 as given the data is from RNA assemblies we take these as being fragments and they are both cases where the final CDS exon bleeds into the intron, when this occures towards the 3' end of the transcript none of the other filters we regularly use will catch this.

I guess what I don't like about the current logic is the 5' and 3' UTR distinction, i.e. that we are removing transcripts with 5' UTR retained introns but allowing in 3' UTR retained introns, that seems like an inconsistency.

I think we can go two ways either the logic as above which catches 3,5,6 and 8 (basically catching all retained introns except those that code through an intron)

or

a logic which only catches 5 and 8 i.e. cares only about exons that are partial CDS and UTR

In both cases the original green box example is NOT a retained intron transcript

What do you think?

The second option would be to limit the number of models with CDS that originate from intron sequence, simply because from RNA-seq data you can potentially construct many models like this and most (though not all) are likely artefacts or not functional. We would be viewing these as problematic simply because the intron retention affects the CDS and we don't want to pollute the protein sequence repositories.

The first option provides a specific way of removing transcripts with intron retentions.

Their is an argument for both approaches. I'm going to annoy you :-) but my preference (as of now) would be to have two distinct options

Keep_retained_partial_coding default false Keep_retained_exons default true

But i'm open to suggestions.

lucventurini commented 4 years ago

Hi @swarbred , admittedly with this filter I would be mainly interested in catching cases such as no 5 and 8, where the retained intron event is clearly a probable cause of NMD.

I would also be interested in catching no.3, for a simple reason - it might still be non-functional. Imagine there is an ATG in the retained intron at the 5'UTR, leading to a short protein (let's say ~30AA). By default, Mikado would not load this secondary ORF onto the transcript (to avoid splitting it). However, the ribosome is not as smart, and might end up starting to create this protein, only to end up releasing the transcript soon after ... and therefore not translating the correct ORF. If we therefore kept this example in, it would be an immature splicing event that we should have excluded.

So I guess that my hunch would be for a logic that excludes 3, 5, 8 for sure, but might (or might not) exclude no. 6.

Admittedly I would keep this simple and with a single flag :-P But I agree that we should change its name to make sure that it is clear to users (including us!) exactly what we are trying to catch with this control.

lucventurini commented 4 years ago

PS: I will create a series of unit tests replicating the figure above, in order to make sure that the modifications will do what we want.

swarbred commented 4 years ago

I would also be interested in catching no.3, for a simple reason - it might still be non-functional. Imagine there is an ATG in the retained intron at the 5'UTR, leading to a short protein

If there was an upstream ORF then this could make the transcript a target for NMD but I don't think that something we want to get into here, you could make a similar argument for any transcript e.g. number 2. I wouldn't make a special case for 5UTR retained introns e.g. number 3.

Number 6 is only different to 8 if you take both as full-length transcripts, in all likelihood if you looked across RNA-Seq assemblies generated from multiple samples you will find many transcripts with the same structure but alternative 5' starts i.e. potentially positions intermediate to 6 and 8, so it might seem a bit strange to filter out just those where the first exon is longer (i.e. removing 8 not 6). That said while pacbio and nanopore don't necessarily give full length transcripts the exact same transcript 6 from this data and I'm more inclined to retain it as an alternative TSS. So I could accept 6 (i.e. non coding exons with the outer edge ending in the intron) being treated differently to 8.

Though would the padding just extend 6 to be like 8 anyway?

lucventurini commented 4 years ago

If there was an upstream ORF then this could make the transcript a target for NMD but I don't think that something we want to get into here, you could make a similar argument for any transcript e.g. number 2. I wouldn't make a special case for 5UTR retained introns e.g. number 3.

OK, I see your case. I think it might let through some subtly erroneous transcripts, but I can live with that.

Number 6 is only different to 8 if you take both as full-length transcripts, in all likelihood if you looked across RNA-Seq assemblies generated from multiple samples you will find many transcripts with the same structure but alternative 5' starts i.e. potentially positions intermediate to 6 and 8, so it might seem a bit strange to filter out just those where the first exon is longer (i.e. removing 8 not 6). That said while pacbio and nanopore don't necessarily give full length transcripts the exact same transcript 6 from this data and I'm more inclined to retain it as an alternative TSS. So I could accept 6 (i.e. non coding exons with the outer edge ending in the intron) being treated differently to 8.

I agree, I would keep no. 6 in, and you answered my doubts about the slight modification.

Though would the padding just extend 6 to be like 8 anyway?

No, we explicitly disallow the padding when the transcript ends within an intron (see here). This is exactly to avoid introducing mistakes like turning case 6 into case 8.

See next comment for summary

lucventurini commented 4 years ago

Their is an argument for both approaches. I'm going to annoy you :-) but my preference (as of now) would be to have two distinct options Keep_retained_partial_coding default false Keep_retained_exons default true

I think that basically this is right, although it does require some coding and the addition of another transcript metric ...

So we would have the following:

"retained_intron" metric: this is strictly about the definition of retained introns. So no. 3, 5, 6, and 8 would all fall into this. This is the current metric, and it would strictly remain numeric.
"cds_disrupted_by_ri" metric (or something similar): this is a boolean metric that would check whether the CDS has been disrupted by the retained intron. This would apply both at the 3' and at the 5' - ie premature starts would count as well as premature stops (e.g. in case no. 3 had an ATG inside the intron that caused a completely different, and wrong, protein to be loaded).

Retained introns would be defined as:

(non-coding or partially-coding?) internal exons that completely span one (or more) introns of another transcript and that
- have their splice sites congruent with the exons bordering the candidate intron, if those are internal
- end within the exons of the other transcript, if they are terminal
terminal exons that end within the intron of another transcript and have their internal splice site in common with the bordering exon.

This would classify all of no. 3, 5, 6 and 8 as having retained introns (as this is a structural assessment). Only no. 5 and 8 would be classified as having a CDS disrupted by the RI.

If we are going to code things like this, it might also make sense to flag no. 4 as having a retained intron, although the "cds_disrupted_by_ri" would of course be set to False.

Thoughts?

lucventurini commented 4 years ago

Hi @swarbred , while checking the code I did realise that there is an additional complexity. Let us say that we have the following situation: IMG_20200117_120721

Transcripts 1 and 2 do not have a retained intron. Transcript 3 has a retained intron, BUT its boundaries are one for transcript 2 (5') and the other on transcript 1 (3').

Currently the data structure I use to store the splicing graph in the locus does not remember the association. It would recognise that there are two possible exons as donors for the intron, and two alternatives for the acceptor. Unless I modify the data structure, then, transcript 3 would be classified as having a retained intron - even if that specific arrangement of exons had not been found in any of the reconstructed transcripts.

I think this is the correct behaviour (as there is evidence from other transcripts that both boundaries are correct), but I would like to hear your thoughts about it.

swarbred commented 4 years ago

@lucventurini yes your above example is fine, that's a retained intron.

swarbred commented 4 years ago

(non-coding or partially-coding?) internal exons that completely span one (or more) introns of another transcript and that

Yes both non-coding and partially coding

If we are going to code things like this, it might also make sense to flag no. 4 as having a retained intron, although the "cds_disrupted_by_ri" would of course be set to False.

number 4 is a retained intron but I think we should still exclude this from our definition as the use of this metric would be within the splicing requirements to exclude IR transcripts and dont think we really ever want to exclude 4.

The metric definition should make this clear then people are aware if they use this.

lucventurini commented 4 years ago

(non-coding or partially-coding?) internal exons that completely span one (or more) introns of another transcript and that

Yes both non-coding and partially coding

If we are going to code things like this, it might also make sense to flag no. 4 as having a retained intron, although the "cds_disrupted_by_ri" would of course be set to False.

number 4 is a retained intron but I think we should still exclude this from our definition as the use of this metric would be within the splicing requirements to exclude IR transcripts and dont think we really ever want to exclude 4.

The metric definition should make this clear then people are aware if they use this.

OK, I am starting to code things up. Not sure I will be able to finish the mods before leaving for annual leave on 23rd night, but I will try my best.

lucventurini commented 4 years ago

Hi @swarbred, I am almost finished with the algorithmic modifications. However, there is an important caveat: now we will have two metrics associated with this feature, not one. This will most probably require modifying the scoring files to penalise heavily transcripts with their CDS disrupted and less (or even not at all) retained introns per se. Thoughts and/or ideas on how to go about this?

lucventurini commented 4 years ago

Hi @swarbred , @cschu , @gemygk

As of 9088796, the code in issue-255 should now be functional. I have tested the behaviour of the new definition and it seems to do what we want, according to the definitions above. Specifically:

a retained intron event must either have "terminal" boundaries OR share the internal splices with other transcripts (the two internal boundaries can come from two different transcripts).
"CDS dirsuption" is now its own, boolean metric and it is separate from the simple retained intron
fully coding exons are never marked as retained introns
Removed the "consider_truncated_for_retained" boolean option, now the definition of RI is strict and cannot be changed from outside
Now Mikado has two flags regarding the behaviour in pick, ie, "keep_retained_introns" (default: True) and "keep_cds_disrupted_by_ri" (default: False).
mikado configure and mikado pick can now influence these behaviours through CLI options.

Coding-wise, this issue should be solved. If you could test and verify, we can then push these changes into master. However, importantly, I have not touched the scoring files. These will presumably have to be updated to reflect the changes.

swarbred commented 4 years ago

@gemygk @lucventurini I've installed this as mikado-2.0rc6_9088796_CBG

@gemygk I will test this by running on Uromyces beticola but it would be good if you could rerun the transcript run for Aphid as well.

lucventurini commented 4 years ago

Just a quick note that there was a small uncorrected mistake that I remedied to in 3dc2ed9.

This bug should have a negligible effect (basically it would mark a transcript as retained intron if one of its boundaries was exactly the boundary of an intron falling after it), so it should have very few differences with 9088796. But this one is the version that will end up being merged into master. Many apologies, I realised only when I was reviewing the final tally of the changes compared to master.

swarbred commented 4 years ago

@lucventurini i've looked at runs with 3dc2ed9 with keep_cds_disrupted_by_ri = true and false

This generally looks ok in that with keep_cds_disrupted_by_ri = true the extra models that are output make sense. It's harder to check to see if models not output were correctly rejected by these flags as there many reasons transcripts might be thrown out and without checking the scoring and turning off all the cds cdna length checks etc it's difficult to know the reason.

Having said that then I came across this region pickrun2 = keep_cds_disrupted_by_ri = false pickrun3 = keep_cds_disrupted_by_ri = true

My query is are re expecting CoDisc_scallop_CoDisc_sca.200.0.4 in the pickrun2? Given CoDisc_scallop_CoDisc_sca.200.0.3 is excluded from pickrun2

i.e. do we make a special case for CoDisc_scallop_CoDisc_sca.200.0.4 because the retained intron that causes the disrupted cds is the last intron in one of the other transcripts.

lucventurini commented 4 years ago

Hi @swarbred, that looks suspicious. I will have a look at the code on Monday, thanks for giving the example.

lucventurini commented 4 years ago

Hi @swarbred , actually I have a question: does the strange result have more than one ORF? Specifically a second ORF after the stop reported there? I am asking because the algorithm checks for the combined CDS end. If there was a shorter ORF on the end of the transcript, it could confuse the procedure.

If you could send me the GTF of these four transcripts, I will also devise and run tests locally.

swarbred commented 4 years ago

@lucventurini

No secondary orf in the two transcripts mentioned above

# Sequence Data: seqnum=2982;seqlen=1787;seqhdr="CoDisc_scallop_CoDisc_sca.200.0.3"
CoDisc_scallop_CoDisc_sca.200.0.3   Prodigal_v2.6.3 CDS 2   982 105.1   +   0   ID=2982_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.412;conf=100.00;score=105.09;cscore=101.87;sscore=3.22;rscore=0.00;uscore=0.00;tscore=3.22;

# Sequence Data: seqnum=2978;seqlen=1698;seqhdr="CoDisc_scallop_CoDisc_sca.200.0.4"
CoDisc_scallop_CoDisc_sca.200.0.4   Prodigal_v2.6.3 CDS 2   1216    152.5   +   0   ID=2978_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.405;conf=100.00;score=152.48;cscore=149.26;sscore=3.22;rscore=0.00;uscore=0.00;tscore=3.22;

Attached the gtf of the region extracted from the final pick file (extracted with gffread)

mikado.loci.Contig_0000010_80459_84569.gff3.txt

lucventurini commented 4 years ago

Hi @swarbred, Great, thank you. I will revise and hunt down the bug. I will also revise the logic Re combined vs selected CDS. Rereading the code I think mikado already checks for untranslated fragments between ORFs, but I will reanalyse and comment to make sure.

lucventurini commented 4 years ago

Hi @swarbred, Many thanks for the file, I have identified the source of the bug in this function. I am amending it presently.

lucventurini commented 4 years ago

Hi @swarbred , I think I have solved it. The only caveat is the following case:

IMG_20200218_115702

Previously, the third transcript was marked as NOT having a retained intron. This is incompatible with the case you highlighted above (which is now solved and unit-tested), and I also think it harkens back to older definitions of the method.

Please let me know if the updated version (1258c37) functions as needed, and whether you are OK with the change in the image above.

swarbred commented 4 years ago

@lucventurini

Just to clarify transcript 2 and 3 are marked as having a retained intron, correct?

and they are also excluded by keep_cds_disrupted_by_ri = false ?

lucventurini commented 4 years ago

Yes to both questions.

swarbred commented 4 years ago

@lucventurini testing with 1258c37

These are the different configurations

configuration_run6_test1.toml

keep_cds_disrupted_by_ri = true keep_retained_introns = true

configuration_run6_test2.toml

keep_cds_disrupted_by_ri = false keep_retained_introns = true

configuration_run6_test3.toml

keep_cds_disrupted_by_ri = true keep_retained_introns = false

configuration_run6_test4.toml

keep_cds_disrupted_by_ri = false keep_retained_introns = false

2nd transcript from the top mikado.Calendula_officinalis_EIV1.1_Contig_0000010G16.2 should not be in configuration_run6_test2 as keep_cds_disrupted_by_ri = false

Looks like keep_retained_introns = is functioning ok in this example but not keep_cds_disrupted_by_ri

lucventurini commented 4 years ago

This issue will make me start go bald out of tearing my own hair out ... OK, something's amiss.

Could you send me the mikado_loci.gff3 of configuration 1, please? I do not understand how this can be, I should have fixed this exact case on the latest commit.

swarbred commented 4 years ago

Arghhhh it's me

mikado-2.0rc6_1258c37_CBG.def git checkout 2c9746b

must not have correctly saved the def file after editing the commit

Well my error is quicker to redo

lucventurini commented 4 years ago

No problem, I will wait the results of the new run!

swarbred commented 4 years ago

so good news after running the correct version the above loci looks correct (test1 etc as described above)

swarbred commented 4 years ago

@lucventurini

the query is should we expect CoDisc_stringtie_CoDisc_str.184.2 in the test2 run

keep_cds_disrupted_by_ri = false keep_retained_introns = true

lucventurini commented 4 years ago

@swarbred

the query is should we expect CoDisc_stringtie_CoDisc_str.184.2 in the test2 run

Yes, for the current definition yes. I can add a further definition to catch it: ie if a retained intron upstream of the CDS start completely spans a CDS intron, then mark the transcript as having its CDS disrupted. Would that work?

swarbred commented 4 years ago

Hi @lucventurini

Yes that would work, with that incorporated could you summarise the logic for identifying if a retained intron disrupts the CDS as i'm not clear why in the logic we would need to make a distinction between if the retained intron causes either a premature stop or causes an internal ATG to be used.

Could the logic be that for any retained introns identified in a transcript (so ignoring retained introns that fully code, as they are not RI for our purpose) check if the intron spans a CDS intron in another transcript (I assume this is against all transcripts not just the primary), why wouldn't that work? How is that different to the current logic?

So in the example https://user-images.githubusercontent.com/8897821/74734614-cf759780-5246-11ea-82dd-eda5b8138c5b.jpg

when you indicated the third model was flagged as disrupted you were viewing this as a reverse strand model i.e. the diagram shows 3' to 5'?

lucventurini commented 4 years ago

Could the logic be that for any retained introns identified in a transcript (so ignoring retained introns that fully code, as they are not RI for our purpose) check if the intron spans a CDS intron in another transcript (I assume this is against all transcripts not just the primary), why wouldn't that work? How is that different to the current logic?

This could end up flagging as "CDS disrupted" transcripts with a retained intron in the 3' UTR, with the CDS ending though well within an exon. Hence why the logic is a little bit more complex.

Currently the logic for is as follows:

exclude from consideration (also for retained introns) any internal exon which is fully coding
mark as "retained intron" any remaining exon which fully spans two splice sites OR any terminal exon which ends within an intron
if an exon is partially coding, and its non-coding parts are within a CDS intron of another transcript, mark it as "CDS disrupted".
if an exon is terminal and coding, and it ends within a CDS intron of another transcript, mark it as "CDS disrupted".

The additional point would therefore be:

if an exon is 5' UTR and is a retained intron of a CDS intron, then mark the transcript as "CDS disrupted".

I will make the changes now, should be up in ~1 hour tops.

lucventurini commented 4 years ago

@swarbred

when you indicated the third model was flagged as disrupted you were viewing this as a reverse strand model i.e. the diagram shows 3' to 5'?

No, on both strands it would be marked as "CDS disrupted". The point being that the non-coding parts of the partially coding exon are intersecting a CDS intron.

lucventurini commented 4 years ago

@swarbred

Update: please see b8e1ca84 as it fixes the issue and the compatibility, below.

Please see 9db1ab89. PS: the checks fail because of an update in msgpack. I am now working to understand how to fix the problem.

swarbred commented 4 years ago

Hi @lucventurini

CoDisc_stringtie_CoDisc_str.184.2 (above example) is still returned with keep_cds_disrupted_by_ri = false

out_swarbred.mikado-2.0rc6_9db1ab8_CBG.configuration_run6_test2.25812894.err.txt mikado.loci.gff3.txt

lucventurini commented 4 years ago

@swarbred

OK, having a look now.

lucventurini commented 4 years ago

@swarbred

Solved, I mistakenly had Mikado stop looking for breaks in the CDS after finding the first retained intron. This caused the failure when a 5' UTR exon retained both CDS and non-CDS introns.

Please use 01ef83f for testing (this commit also ensures that a non-coding transcript will not be marked as having a CDS broken by a retained intron - which would be non-sensical).

swarbred commented 4 years ago

@lucventurini yes agree this is correct for the test region

lucventurini commented 4 years ago

@lucventurini yes agree this is correct for the test region

Fantastic :-) Please let me know when your tests on this are complete, so that we can finally merge into master and close. I hope (and am cautiously confident?) that maybe this time we have it.

EI-CoreBioinformatics / mikado

Potential retained intron bug RC6 #255

2 is a retained intron (new stop within the CDS intron of #1)

3 is a retained intron (new start within the CDS intron of #1)

4 is a retained intron (new stop within the intron of #1 )

5 is not clear to me, is its 5'most splicing site meant to be within the first intron of #1?

6 is not a retained intron, the new stop exists because of a novel exon - not because the pre-existing exon "bleeds" into the intron.