Closed swarbred closed 4 years ago
Hi David,
the "find_retained_introns" method has not changed between the two releases.
Another issue: the wrong model should be mikado.scaffold_1G6704.4|mikado_all_mikado.scaffold_1G8446.3
.
However, I cannot find it among in the mikado.loci.gff3
in neither folder you indicated:
Supposedly mikado v. 2.0rc4:
/tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0_rc1_run6/integration/integration_run4_dstest10
This should have 3 output models, but I actually see 4. Extra model: Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.5
Supposedly mikado v. 2.0rc6:
/tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0_rc1_run6/integration/integration_run4_dstest8
This should have 5 output models, but I see 3. That is, I see the three models that in Apollo are assigned to 2.0rc4.
Would you please be able to confirm the folder paths?
hi @lucventurini
Paths and versions used look correct to me.
directory integration_run4_dstest10 - mikado-2.0rc4
correctly outputs 4 models: alias=Mikado_annotation_run6_ptn_coding_HC_repTrue_mikado.scaffold_1G6704.1 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.2 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.3 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.5
with Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.4 correctly excluded as it contains a retained intron
directory integration_run4_dstest8 - mikado-2.0rc6_3f62484_CBG outputs alias=Mikado_annotation_run6_ptn_coding_HC_repTrue_mikado.scaffold_1G6704.1 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.2 alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.5
so the difference is alias=Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.3 being excluded in the mikado-2.0rc6_3f62484_CBG run.
This model being excluded seems to be an error, if you change to keep_retained_introns: true then this model will be returned and you can see in the log that it's being flagged as a retained intron transcript.
Mikado_annotation_run6_ptn_coding_HC_repFalse_mikado.scaffold_1G6704.3 has 1 retained introns ([(58664468, 58665080)])
Hi @swarbred , thank you for the clarification, I had misunderstood the problem.
I will have a look at what is causing the change in behaviour and report back.
Hi @swarbred, @gemygk,
I traced the cause of the disparity. It is explained explicitly here:
Fixed in 443e9e2. Now Mikado will do the following:
- if a transcript is non-coding, consider all its exons.
- If a transcript is coding: -- if the exon is a complete (ie fully UTR) 3' UTR exon, do not consider it further. -- If the exon is completely coding, do not consinder it further. -- if a transcript is partially coding: --- if it is in the 5' UTR, consider all the 5'UTR (to detect a delayed Met due to an upstream retained intron) --- else, only consider the boundary between the coding and non-coding areas. These changes ensure that only transcripts where there is a reasonable suspicion that either the Met or the stop codon are misplaced will be marked as having a retained intron.
Please note the bold part.
I do think that these changes are correct, and I remember us discussing them at some point. However, I also understand that in this specific case, where the Met start did not change due to the retained intron, you might see this result as incorrect.
What should the behaviour be in instances like this?
HI @lucventurini @gemygk
Strictly I wouldn't view the original transcript i queried as having a retained intron (green box below)
Screen Shot 2020-01-06 at 14.38.21.pdf
In this case the model contains a unique splice site, a retained intron should extend through the entire intron sequence of another model. We might have loosened this definition in the original logic as in RNA-seq assemblies we will have models where the terminal extends into the intron and we might just view the model as being a fragment and still want to flag these as retained (see diagram below).
I'm not sure if I follow the logic you have described above, it would seem to suggest that the example in the green box (i.e. a complete UTR exon) isn't a retained intron but in this case the model was identified as such by mikado-2.0rc6_3f62484_CBG
I've since run mikado-2.0rc1 and mikado-2.0rc6_3f62484_CBG on the same data and have this example
The model selected in red was marked as NOT having a retained intron in mikado-2.0rc6_3f62484_CBG but was in mikado-2.0rc1 . I would view mikado-2.0rc1 as correct here.
How would the current logic deal with the following examples, which would be flagged as having a retained intron.
I would say #4 and #6 have a retained intron and potentially #2 and #3 if you make a special case for terminal exons (not sure how either the new or old logic would handle these).
Hi @swarbred , yes I see what you are saying. Regarding the example we are discussing (hence the green box in the PDF) the exon gets flagged as retained because:
1- it is completely non-coding and in the 5' UTR 2- it has at least some part which falls within another intron (again in the 5' UTR)
In the second example you are flagging, the exon does not get remarked as a retained intron because: 1- the exon is only partially non-coding 2- the CDS start is outside of the intron itself.
The aim, which is proving quite hard to implement, is straight-forward: mark as retained introns only those retentions that could affect the CDS for coding transcripts. This is because Mikado should consider retained introns as legitimate ASs unless they disrupt the coding region.
Hence why probably in both cases above Mikado should not consider either, I think, as retained introns (ie cause for exclusion).
Coming to your diagram (but we would have to verify the behaviour of Mikado with a mock-up):
PS: the function is defined here. It has two steps:
The logic, again, is to mark as a "retained intron" any exon for which we have evidence that the CDS got interrupted because of not having respected the intron boundaries - or, alternatively, because it is a retained intron in the 5' UTR.
Hi @swabred, following the discussion today, it might be a good idea to find a good set of papers laying down widely accepted definitions of "retained intron". I have found this preliminary panel of articles to read:
Please add any other you think relevant!
HI @lucventurini
There is 1) what's the definition of intron retention event and 2) how we want to apply this in Mikado the two probably are not the same.
On my hand diagram I think only 4 and 6 represent what would be regarded as strictly intron retention. Number 5 would be an alternative donor splice site, number 2 could be an alternative Polyadenylation site (if you take the transcript as being full-length), number 3 would be an alternative transcription start site (again if you take the transcript as full-length).
So if we apply what I believe is the general use of retained intron we would filter 4 and 6 (point me to a source if you disagree).
However the original intention was for this to filter out transcripts that were common to RNA-seq assemblies which we viewed as undesirable, this would have included 2 and 3 as we viewed these as fragments rather than full-length transcripts i.e. we have only a partial assembly and the true start and end would give the transcript a structure with a full intron retention event. If the transcript was known to be a full-length transcript then I would view these two examples differently.
I think this also ties in with the CDS /cDNA overlap check we apply, I believe we may have changed this (but correct me if i'm wrong), lets assume the cds overlap is applied and set to 80% I think originally examples like number 2 would have met this threshold as greater than 80% of the shorter transcript (2) is contained in the larger number 1 model, I think we switched how this is handled and now number 2 would be filtered out as the cds overlap is relative to the longer transcript number 1.
If this overlap logic is indeed changed then it might mean that the "keep retained introns" flag isn't as relevant for identifying and filtering some types of transcripts.
The original purpose was (from my perspective) was to filter out undesirable transcripts that we were not pulling out through other filters we apply as part of the alt splice checks i.e like cases where the penultimate exon is retained (2 and 3 below) as such transcript will be fairly common in RNA-Seq assemblies and are likely not filtered out by cds/cdna overlap or UTR requirements. Hence the focus on intron retention in CDS exons.
[]--[]--[]--[] []--[]--[_] []--[]--[____]
I'm not clear why we make a distinction between 5' and 3' UTR exons for purposes of retained introns other than you have more 5' UTR exons than 3' UTR and they wont affect the CDS, the green box example is not something that I think we intended to exclude (in that same snapshot the red box transcript is a genuine retained intron). This part
"if the candidate intersects only one of the two boundaring exons, mark it as retained only if it is a terminal exon."
is to catch number 2 in my example not the green box example where the exon while terminal splices on to another exon. Also if I read the logic correct if in the green box example we added another 5' exon (the same as the top transcript in the pic) then this would then NOT be classed as retained as it wouldn't be a terminal exon.
If the logic was to ignore entirely coding exons, and then check partial and UTR exons to see if they overlap introns, for those that do then for the:
internal exons they would need to span an entire intron and share the same exon1 start and exon2 end of the introns two flanking exons
for terminal exon they would need to only share the same internal exon start coordinate with the exon of the overlapping intron to be regarded as meeting the retained intron requirement.
Another set of examples as this is not easy to follow, those marked with a cross is what I would want this filter to exclude.
What do you think?
I think my only disagreement with your figure is case 9: I would exclude it, as the stop is clearly an NMD due to intron retention, and seems to be the non-fragmentary version of 8. Thoughts?
In practice I was trying to use the retained intron flag in Mikado to do something different from what you were describing - ie identify and filter out NMD induced by the retained introns, as well as transcripts where the atg was mangled by a retained intron in the 5' UTR. Importantly, this means also excluding full length transcripts that are actually real (if transient), not just artefacts coming from the sequencing or assembly protocols.
@swarbred Rereading your comment above I can see why you would class no. 9 as a non retained intron. From what I understand your logic could be formalised as follows:
So in practice we would be adding the additional constraint that a retained intron event must have at the 5' the same exon start as the template, correct transcript. I'm not clear on the 3' end. My questions on this would be:
if and only if it starts at an exon A of another transcript and ends at the successive exon B of the same transcript
or exon C, D in the case it retains multiple introns
So in practice we would be adding the additional constraint that a retained intron event must have at the 5' the same exon start as the template, correct transcript. I'm not clear on the 3' end. My questions on this would be:
for the terminal exons what matters is that the internal splice site of the exon matches the reference
what happens if the exon is not terminal but ends up spanning the whole intron, ending in exon B of the template, but not at the boundary of B? Or even spans more than one intron/exon? I'm thinking about the completely immature RNA here, with most of the introns still unspliced.
If it spans the whole intron but ends within exon B then this is an alternative splice site and not a retained intron. If it spans multiple introns but still ends on an exon boundary then it's a retained intron.
The 5' end is always harder to call precisely. What happens with case no. 6 or a hypothetical no. 6" where the starting e on begins within the first exon of the template but not at the exact boundary?
Based on what i described number 6 should have been marked as a cross as it's a terminal exon and it's internal splice site is shared with the reference transcript and it's internal exon is within the intron (exon would also meet the criteria( of the reference transcript.
where the starting e on begins within the first exon of the template but not at the exact boundary?
Based on what i wrote even landing in the first exon would meet the criteria. However in practice from RNA-Seq assemblies you will likely see less examples of these most will end in the intron as the coverage drops.
Number 5,6 and 8 are challenging as if these are full length transcripts then they are not retained intron events.
I change my mind on this the more i think about it. To be honest how this filter is named is partly the issue as it suggests the intention is to filter out all retained introns which it doesn't do and wasn't the original intention. I think the original intention was to mainly catch number 5 and 8 as given the data is from RNA assemblies we take these as being fragments and they are both cases where the final CDS exon bleeds into the intron, when this occures towards the 3' end of the transcript none of the other filters we regularly use will catch this.
I guess what I don't like about the current logic is the 5' and 3' UTR distinction, i.e. that we are removing transcripts with 5' UTR retained introns but allowing in 3' UTR retained introns, that seems like an inconsistency.
I think we can go two ways either the logic as above which catches 3,5,6 and 8 (basically catching all retained introns except those that code through an intron)
or
a logic which only catches 5 and 8 i.e. cares only about exons that are partial CDS and UTR
In both cases the original green box example is NOT a retained intron transcript
What do you think?
The second option would be to limit the number of models with CDS that originate from intron sequence, simply because from RNA-seq data you can potentially construct many models like this and most (though not all) are likely artefacts or not functional. We would be viewing these as problematic simply because the intron retention affects the CDS and we don't want to pollute the protein sequence repositories.
The first option provides a specific way of removing transcripts with intron retentions.
Their is an argument for both approaches. I'm going to annoy you :-) but my preference (as of now) would be to have two distinct options
Keep_retained_partial_coding default false Keep_retained_exons default true
But i'm open to suggestions.
Hi @swarbred , admittedly with this filter I would be mainly interested in catching cases such as no 5 and 8, where the retained intron event is clearly a probable cause of NMD.
I would also be interested in catching no.3, for a simple reason - it might still be non-functional. Imagine there is an ATG in the retained intron at the 5'UTR, leading to a short protein (let's say ~30AA). By default, Mikado would not load this secondary ORF onto the transcript (to avoid splitting it). However, the ribosome is not as smart, and might end up starting to create this protein, only to end up releasing the transcript soon after ... and therefore not translating the correct ORF. If we therefore kept this example in, it would be an immature splicing event that we should have excluded.
So I guess that my hunch would be for a logic that excludes 3, 5, 8 for sure, but might (or might not) exclude no. 6.
Admittedly I would keep this simple and with a single flag :-P But I agree that we should change its name to make sure that it is clear to users (including us!) exactly what we are trying to catch with this control.
PS: I will create a series of unit tests replicating the figure above, in order to make sure that the modifications will do what we want.
I would also be interested in catching no.3, for a simple reason - it might still be non-functional. Imagine there is an ATG in the retained intron at the 5'UTR, leading to a short protein
If there was an upstream ORF then this could make the transcript a target for NMD but I don't think that something we want to get into here, you could make a similar argument for any transcript e.g. number 2. I wouldn't make a special case for 5UTR retained introns e.g. number 3.
Number 6 is only different to 8 if you take both as full-length transcripts, in all likelihood if you looked across RNA-Seq assemblies generated from multiple samples you will find many transcripts with the same structure but alternative 5' starts i.e. potentially positions intermediate to 6 and 8, so it might seem a bit strange to filter out just those where the first exon is longer (i.e. removing 8 not 6). That said while pacbio and nanopore don't necessarily give full length transcripts the exact same transcript 6 from this data and I'm more inclined to retain it as an alternative TSS. So I could accept 6 (i.e. non coding exons with the outer edge ending in the intron) being treated differently to 8.
Though would the padding just extend 6 to be like 8 anyway?
If there was an upstream ORF then this could make the transcript a target for NMD but I don't think that something we want to get into here, you could make a similar argument for any transcript e.g. number 2. I wouldn't make a special case for 5UTR retained introns e.g. number 3.
OK, I see your case. I think it might let through some subtly erroneous transcripts, but I can live with that.
Number 6 is only different to 8 if you take both as full-length transcripts, in all likelihood if you looked across RNA-Seq assemblies generated from multiple samples you will find many transcripts with the same structure but alternative 5' starts i.e. potentially positions intermediate to 6 and 8, so it might seem a bit strange to filter out just those where the first exon is longer (i.e. removing 8 not 6). That said while pacbio and nanopore don't necessarily give full length transcripts the exact same transcript 6 from this data and I'm more inclined to retain it as an alternative TSS. So I could accept 6 (i.e. non coding exons with the outer edge ending in the intron) being treated differently to 8.
I agree, I would keep no. 6 in, and you answered my doubts about the slight modification.
Though would the padding just extend 6 to be like 8 anyway?
No, we explicitly disallow the padding when the transcript ends within an intron (see here). This is exactly to avoid introducing mistakes like turning case 6 into case 8.
See next comment for summary
Their is an argument for both approaches. I'm going to annoy you :-) but my preference (as of now) would be to have two distinct options Keep_retained_partial_coding default false Keep_retained_exons default true
I think that basically this is right, although it does require some coding and the addition of another transcript metric ...
So we would have the following:
Retained introns would be defined as:
This would classify all of no. 3, 5, 6 and 8 as having retained introns (as this is a structural assessment). Only no. 5 and 8 would be classified as having a CDS disrupted by the RI.
If we are going to code things like this, it might also make sense to flag no. 4 as having a retained intron, although the "cds_disrupted_by_ri" would of course be set to False.
Thoughts?
Hi @swarbred , while checking the code I did realise that there is an additional complexity. Let us say that we have the following situation:
Transcripts 1 and 2 do not have a retained intron. Transcript 3 has a retained intron, BUT its boundaries are one for transcript 2 (5') and the other on transcript 1 (3').
Currently the data structure I use to store the splicing graph in the locus does not remember the association. It would recognise that there are two possible exons as donors for the intron, and two alternatives for the acceptor. Unless I modify the data structure, then, transcript 3 would be classified as having a retained intron - even if that specific arrangement of exons had not been found in any of the reconstructed transcripts.
I think this is the correct behaviour (as there is evidence from other transcripts that both boundaries are correct), but I would like to hear your thoughts about it.
@lucventurini yes your above example is fine, that's a retained intron.
(non-coding or partially-coding?) internal exons that completely span one (or more) introns of another transcript and that
Yes both non-coding and partially coding
If we are going to code things like this, it might also make sense to flag no. 4 as having a retained intron, although the "cds_disrupted_by_ri" would of course be set to False.
number 4 is a retained intron but I think we should still exclude this from our definition as the use of this metric would be within the splicing requirements to exclude IR transcripts and dont think we really ever want to exclude 4.
The metric definition should make this clear then people are aware if they use this.
(non-coding or partially-coding?) internal exons that completely span one (or more) introns of another transcript and that
Yes both non-coding and partially coding
If we are going to code things like this, it might also make sense to flag no. 4 as having a retained intron, although the "cds_disrupted_by_ri" would of course be set to False.
number 4 is a retained intron but I think we should still exclude this from our definition as the use of this metric would be within the splicing requirements to exclude IR transcripts and dont think we really ever want to exclude 4.
The metric definition should make this clear then people are aware if they use this.
OK, I am starting to code things up. Not sure I will be able to finish the mods before leaving for annual leave on 23rd night, but I will try my best.
Hi @swarbred, I am almost finished with the algorithmic modifications. However, there is an important caveat: now we will have two metrics associated with this feature, not one. This will most probably require modifying the scoring files to penalise heavily transcripts with their CDS disrupted and less (or even not at all) retained introns per se. Thoughts and/or ideas on how to go about this?
Hi @swarbred , @cschu , @gemygk
As of 9088796, the code in issue-255 should now be functional. I have tested the behaviour of the new definition and it seems to do what we want, according to the definitions above. Specifically:
mikado configure
and mikado pick
can now influence these behaviours through CLI options.Coding-wise, this issue should be solved. If you could test and verify, we can then push these changes into master
.
However, importantly, I have not touched the scoring files. These will presumably have to be updated to reflect the changes.
@gemygk @lucventurini I've installed this as mikado-2.0rc6_9088796_CBG
@gemygk I will test this by running on Uromyces beticola but it would be good if you could rerun the transcript run for Aphid as well.
Just a quick note that there was a small uncorrected mistake that I remedied to in 3dc2ed9.
This bug should have a negligible effect (basically it would mark a transcript as retained intron if one of its boundaries was exactly the boundary of an intron falling after it), so it should have very few differences with 9088796.
But this one is the version that will end up being merged into master
.
Many apologies, I realised only when I was reviewing the final tally of the changes compared to master
.
@lucventurini i've looked at runs with 3dc2ed9 with keep_cds_disrupted_by_ri = true and false
This generally looks ok in that with keep_cds_disrupted_by_ri = true the extra models that are output make sense. It's harder to check to see if models not output were correctly rejected by these flags as there many reasons transcripts might be thrown out and without checking the scoring and turning off all the cds cdna length checks etc it's difficult to know the reason.
Having said that then I came across this region pickrun2 = keep_cds_disrupted_by_ri = false pickrun3 = keep_cds_disrupted_by_ri = true
My query is are re expecting CoDisc_scallop_CoDisc_sca.200.0.4 in the pickrun2? Given CoDisc_scallop_CoDisc_sca.200.0.3 is excluded from pickrun2
i.e. do we make a special case for CoDisc_scallop_CoDisc_sca.200.0.4 because the retained intron that causes the disrupted cds is the last intron in one of the other transcripts.
Hi @swarbred, that looks suspicious. I will have a look at the code on Monday, thanks for giving the example.
Hi @swarbred , actually I have a question: does the strange result have more than one ORF? Specifically a second ORF after the stop reported there? I am asking because the algorithm checks for the combined CDS end. If there was a shorter ORF on the end of the transcript, it could confuse the procedure.
If you could send me the GTF of these four transcripts, I will also devise and run tests locally.
@lucventurini
No secondary orf in the two transcripts mentioned above
# Sequence Data: seqnum=2982;seqlen=1787;seqhdr="CoDisc_scallop_CoDisc_sca.200.0.3"
CoDisc_scallop_CoDisc_sca.200.0.3 Prodigal_v2.6.3 CDS 2 982 105.1 + 0 ID=2982_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.412;conf=100.00;score=105.09;cscore=101.87;sscore=3.22;rscore=0.00;uscore=0.00;tscore=3.22;
# Sequence Data: seqnum=2978;seqlen=1698;seqhdr="CoDisc_scallop_CoDisc_sca.200.0.4"
CoDisc_scallop_CoDisc_sca.200.0.4 Prodigal_v2.6.3 CDS 2 1216 152.5 + 0 ID=2978_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.405;conf=100.00;score=152.48;cscore=149.26;sscore=3.22;rscore=0.00;uscore=0.00;tscore=3.22;
Attached the gtf of the region extracted from the final pick file (extracted with gffread)
Hi @swarbred, Great, thank you. I will revise and hunt down the bug. I will also revise the logic Re combined vs selected CDS. Rereading the code I think mikado already checks for untranslated fragments between ORFs, but I will reanalyse and comment to make sure.
Hi @swarbred, Many thanks for the file, I have identified the source of the bug in this function. I am amending it presently.
Hi @swarbred , I think I have solved it. The only caveat is the following case:
Previously, the third transcript was marked as NOT having a retained intron. This is incompatible with the case you highlighted above (which is now solved and unit-tested), and I also think it harkens back to older definitions of the method.
Please let me know if the updated version (1258c37) functions as needed, and whether you are OK with the change in the image above.
@lucventurini
Just to clarify transcript 2 and 3 are marked as having a retained intron, correct?
and they are also excluded by keep_cds_disrupted_by_ri = false ?
Yes to both questions.
@lucventurini testing with 1258c37
These are the different configurations
configuration_run6_test1.toml
keep_cds_disrupted_by_ri = true keep_retained_introns = true
configuration_run6_test2.toml
keep_cds_disrupted_by_ri = false keep_retained_introns = true
configuration_run6_test3.toml
keep_cds_disrupted_by_ri = true keep_retained_introns = false
configuration_run6_test4.toml
keep_cds_disrupted_by_ri = false keep_retained_introns = false
2nd transcript from the top mikado.Calendula_officinalis_EIV1.1_Contig_0000010G16.2 should not be in configuration_run6_test2 as keep_cds_disrupted_by_ri = false
Looks like keep_retained_introns = is functioning ok in this example but not keep_cds_disrupted_by_ri
This issue will make me start go bald out of tearing my own hair out ... OK, something's amiss.
Could you send me the mikado_loci.gff3
of configuration 1, please? I do not understand how this can be, I should have fixed this exact case on the latest commit.
Arghhhh it's me
mikado-2.0rc6_1258c37_CBG.def git checkout 2c9746b
must not have correctly saved the def file after editing the commit
Well my error is quicker to redo
No problem, I will wait the results of the new run!
so good news after running the correct version the above loci looks correct (test1 etc as described above)
@lucventurini
the query is should we expect CoDisc_stringtie_CoDisc_str.184.2 in the test2 run
keep_cds_disrupted_by_ri = false keep_retained_introns = true
@swarbred
the query is should we expect CoDisc_stringtie_CoDisc_str.184.2 in the test2 run
Yes, for the current definition yes. I can add a further definition to catch it: ie if a retained intron upstream of the CDS start completely spans a CDS intron, then mark the transcript as having its CDS disrupted. Would that work?
Hi @lucventurini
Yes that would work, with that incorporated could you summarise the logic for identifying if a retained intron disrupts the CDS as i'm not clear why in the logic we would need to make a distinction between if the retained intron causes either a premature stop or causes an internal ATG to be used.
Could the logic be that for any retained introns identified in a transcript (so ignoring retained introns that fully code, as they are not RI for our purpose) check if the intron spans a CDS intron in another transcript (I assume this is against all transcripts not just the primary), why wouldn't that work? How is that different to the current logic?
So in the example https://user-images.githubusercontent.com/8897821/74734614-cf759780-5246-11ea-82dd-eda5b8138c5b.jpg
when you indicated the third model was flagged as disrupted you were viewing this as a reverse strand model i.e. the diagram shows 3' to 5'?
Could the logic be that for any retained introns identified in a transcript (so ignoring retained introns that fully code, as they are not RI for our purpose) check if the intron spans a CDS intron in another transcript (I assume this is against all transcripts not just the primary), why wouldn't that work? How is that different to the current logic?
This could end up flagging as "CDS disrupted" transcripts with a retained intron in the 3' UTR, with the CDS ending though well within an exon. Hence why the logic is a little bit more complex.
Currently the logic for is as follows:
The additional point would therefore be:
I will make the changes now, should be up in ~1 hour tops.
@swarbred
when you indicated the third model was flagged as disrupted you were viewing this as a reverse strand model i.e. the diagram shows 3' to 5'?
No, on both strands it would be marked as "CDS disrupted". The point being that the non-coding parts of the partially coding exon are intersecting a CDS intron.
@swarbred
Update: please see b8e1ca84 as it fixes the issue and the compatibility, below.
Please see 9db1ab89. PS: the checks fail because of an update in
msgpack
. I am now working to understand how to fix the problem.
Hi @lucventurini
CoDisc_stringtie_CoDisc_str.184.2 (above example) is still returned with keep_cds_disrupted_by_ri = false
out_swarbred.mikado-2.0rc6_9db1ab8_CBG.configuration_run6_test2.25812894.err.txt mikado.loci.gff3.txt
@swarbred
OK, having a look now.
@swarbred
Solved, I mistakenly had Mikado stop looking for breaks in the CDS after finding the first retained intron. This caused the failure when a 5' UTR exon retained both CDS and non-CDS introns.
Please use 01ef83f for testing (this commit also ensures that a non-coding transcript will not be marked as having a CDS broken by a retained intron - which would be non-sensical).
@lucventurini yes agree this is correct for the test region
@lucventurini yes agree this is correct for the test region
Fantastic :-) Please let me know when your tests on this are complete, so that we can finally merge into master and close. I hope (and am cautiously confident?) that maybe this time we have it.
I'm seeing a difference between mikado-2.0rc4 and mikado-2.0rc6_* versions in relation to calling and exclusion of transcripts with retained introns.
I'm attaching a screenshot of this region http://apollo.tgac.ac.uk/Myzus_persicae_O_v2_genome_browser/jbrowse/?loc=scaffold_1%3A58658101..58667080&tracks=DNA%2CAnnotations%2CMikado_annotation_run6_classification%2CMikado_integration_run4%2CScallop_lncRNA%2CStringtie_lncRNA%2CYa_locus&highlight=
The input models are shown in track ..run6_classification the output models of running mikado version mikado-2.0rc6_3f62484_CBG are shown in track mikado integration run 4
Two models are excluded from the original input mikado.scaffold_1G6704.4 (correctly as a retained intron transcript) and mikado.scaffold_1G6704.3
This mikado.scaffold_1G6704.3 model I dont think should be viewed as a retained intron transcript and running previous versions mikado-2.0_rc1 and mikado-2.0rc4 on the same input models and config gives mikado.scaffold_1G6704.3 in the output.
For my own knowledge Luca can you confirm that for the retained intron check the order (i.e. the relative scoring) of the transcripts matters i.e. each potential alt splice model is assessed against the primary model and the other models currently added to the locus. So if a transcript is the second highest scoring it might not be regarded as having a retained intron relative to the primary model but if the same transcript scored lower i.e. other transcripts were added before the retained intron check was made then potentially against these it now may have a retained intron and be excluded.
Correct version mikado-2.0rc4
output directory
Incorrect? version
output directory /tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0_rc1_run6/integration/integration_run4_dstest8