EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
98 stars 18 forks source link

Potential retained intron bug RC6 #255

Closed swarbred closed 4 years ago

swarbred commented 4 years ago

I'm seeing a difference between mikado-2.0rc4 and mikado-2.0rc6_* versions in relation to calling and exclusion of transcripts with retained introns.

I'm attaching a screenshot of this region http://apollo.tgac.ac.uk/Myzus_persicae_O_v2_genome_browser/jbrowse/?loc=scaffold_1%3A58658101..58667080&tracks=DNA%2CAnnotations%2CMikado_annotation_run6_classification%2CMikado_integration_run4%2CScallop_lncRNA%2CStringtie_lncRNA%2CYa_locus&highlight=

The input models are shown in track ..run6_classification the output models of running mikado version mikado-2.0rc6_3f62484_CBG are shown in track mikado integration run 4

Two models are excluded from the original input mikado.scaffold_1G6704.4 (correctly as a retained intron transcript) and mikado.scaffold_1G6704.3

This mikado.scaffold_1G6704.3 model I dont think should be viewed as a retained intron transcript and running previous versions mikado-2.0_rc1 and mikado-2.0rc4 on the same input models and config gives mikado.scaffold_1G6704.3 in the output.

For my own knowledge Luca can you confirm that for the retained intron check the order (i.e. the relative scoring) of the transcripts matters i.e. each potential alt splice model is assessed against the primary model and the other models currently added to the locus. So if a transcript is the second highest scoring it might not be regarded as having a retained intron relative to the primary model but if the same transcript scored lower i.e. other transcripts were added before the retained intron check was made then potentially against these it now may have a retained intron and be excluded.

Correct version mikado-2.0rc4

sbatch -p ei-cb -c 1 --mem 20G -o out_mikado.serialise-and-pick.run4.%j.log -J Ov2_Mikado_SP --wrap "source mikado-2.0rc4 && /usr/bin/time -v mikado pick --mode nosplit --seed 10 --procs 1 --json-conf mikado.configuration.integration.run4_scaffold_1_58658101_58667080.yaml --subloci_out mikado.subloci.gff3 --monoloci_out mikado.monoloci.gff3 --output-dir ./integration_run4_dstest10 -lv DEBUG"

output directory

/tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0_rc1_run6/integration/integration_run4_dstest10

Incorrect? version

sbatch -p ei-cb -c 1 --mem 20G -o out_mikado.serialise-and-pick.run4.%j.log -J Ov2_Mikado_SP --wrap "source mikado-2.0rc6_3f62484_CBG && /usr/bin/time -v mikado pick --mode nosplit --seed 10 --procs 1 --json-conf mikado.configuration.integration.run4_scaffold_1_58658101_58667080.yaml --subloci-out mikado.subloci.gff3 --monoloci-out mikado.monoloci.gff3 --output-dir ./integration_run4_dstest8 -lv DEBUG"

output directory /tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0_rc1_run6/integration/integration_run4_dstest8 Screen Shot 2020-01-06 at 14 38 21

swarbred commented 4 years ago

@lucventurini I haven't checked beyond specific known examples but I would like to bring together all the recent changes that we believe are fixed including the serialise memory issue so that I can then use this version for a larger run.

Up to you if you want at this point to merge to master or to another branch with all these changes

lucventurini commented 4 years ago

@swarbred I will merge to master, I think it is time to bring everything together.

I will contestually close #255, #263, #266 and #267.

lucventurini commented 4 years ago

@swarbred

Could you please test 0c57a76 (new develop branch) which squashes together the edits coming from four different branches? There were no conflicts in merging, which is good (ie: the branches were really completely working on different parts of the code).

After testing, we can put these changes in master and close four issues.

swarbred commented 4 years ago

crossed-fingers

lucventurini commented 4 years ago

@swarbred

I am really sorry to say that unfortunately I woke up and realised that there was a bug in the retained intron procedure :-(

In the following figure:

IMG_20200221_103027

0c57a76 would not mark any of these as having a retained intron, let alone having their CDS mangled by having one. This is clearly wrong, I think.

I fixed the situation in d094f995 (always develop branch). Many apologies for this (I did say that this part of the code is having me tearing my hair out!)

swarbred commented 4 years ago

@lucventurini :-( ok I will install and rerun my runs later today, can I clarify that the issue is specific to single exon models as shown above. If so then I agree with your fix but it's less of an issue for my data :-) as we will be excluding these transcripts for other reasons.

swarbred commented 4 years ago

also I assume as you indicated this is on the develop branch it includes all the recent changes as 0c57a76

lucventurini commented 4 years ago

@lucventurini :-( ok I will install and rerun my runs later today, can I clarify that the issue is specific to single exon models as shown above. If so then I agree with your fix but it's less of an issue for my data :-) as we will be excluding these transcripts for other reasons.

Yes, it's specific to single exon models only.

also I assume as you indicated this is on the develop branch it includes all the recent changes as 0c57a76

Yes, correct.

swarbred commented 4 years ago

@lucventurini Based on my full runs, I consider this resolved in d094f99