EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
94 stars 18 forks source link

Overlapping superloci in my resultant mikado.loci.gff3 file #128

Closed anishdattani closed 5 years ago

anishdattani commented 5 years ago

Upon execution of the Mikado Pick, I get some unexpected results.

In the resultant mikado.loci.gff file, I seem to be getting 'superloci' that overlap in genomic coordinates. Surely these should be treated as isoforms not new loci?

I have attached an example of chromosome from this gff file; note that mikado.dd_Smes_g4_386G2 and mikado.dd_Smes_g4_386G4 overlap.

How can I avoid this? This will be troublesome for when I want to make a protein fasta - i.e. I will have two copies of the same locus.

I hope this makes sense, and thank you in advance for your help.

example.txt

lucventurini commented 5 years ago

Dear Anish, thank you for reporting your issue. When clustering transcripts into loci, Mikado takes into account the strand of the models to create the superlocus clustering. This is why in your file you have superloci with overlapping coordinates: the models lie on different strands. More specifically, some of the superloci are on the negative strand while the others have no strand assigned to them.

Normally Mikado would clear the latter ones as putative fragments; however, in your file, the non-stranded superloci have models with a defined CDS. This is the origin of the problem - and indeed, having a CDS with no strand defined is a mistake: the models should either be unstranded and without CDS, or be stranded with one. May I ask you how you generated the input for Mikado pick?

Luca Venturini

anishdattani commented 5 years ago

Dear Luca,

Thank you for your reply -

I have attached a notebook which explains the steps I took for executing Mikado and well as the "configuration.yaml’ file used for mikado pick.

In the mikado serialise step - the database was generated using ORF files generated from Transdecoder of my various assemblies that were mapped to my genome; as well as BLAST data and splicing information.

I then execute using:

mikado pick --json-conf configuration.yaml --subloci_out mikado.subloci.gff3

Thank you again for your time and help in this matter; I really appreciate it.

Best, Anish

On 1 Oct 2018, at 10:21, Luca Venturini notifications@github.com<mailto:notifications@github.com> wrote:

Dear Anish, thank you for reporting your issue. When clustering transcripts into loci, Mikado takes into account the strand of the models to create the superlocus clustering. This is why in your file you have superloci with overlapping coordinates: the models lie on different strands. More specifically, some of the superloci are on the negative strand while the others have no strand assigned to them.

Normally Mikado would clear the latter ones as putative fragments; however, in your file, the non-stranded superloci have models with a defined CDS. This is the origin of the problem - and indeed, having a CDS with no strand defined is a mistake: the models should either be unstranded and without CDS, or be stranded with one. May I ask you how you generated the input for Mikado pick?

Luca Venturini

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/lucventurini/mikado/issues/128#issuecomment-425842287, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVEqR08ww-uX31IWYqHW8w3da0XH2r0Xks5ugd61gaJpZM4XA8dD.

anishdattani commented 5 years ago

Apologies - I cannot post .yaml and the notebook.html files here. I have emailed them instead.
Thank you

anishdattani commented 5 years ago

I also have found another probable error in my mikado.loci.gff3 output - I seem to be lacking entire annotations for entire regions of chromosomes, despite these being annotated in the various gff files that are used as the input. Upon closer inspection, these missed annotations seem to be ones with large introns? Is there a parameter in the pick step that I need to modify. Moreover, as I am working with a flatworm, what scoring .yaml file would be best to use?

lucventurini commented 5 years ago

Dear @anishdattani , sorry for the late reply. I was inspecting your guide and I am not understanding an important point - did you derive the TransDecoder ORFs from the mikado_prepared.fasta file? Looking at the mikado serialise line, it would look like you used the input files to Mikado for that step. However, that would break the link with the transcripts, and would result in no transcript having a proper ORF annotated.

The best scoring file available for your project is probably celegans_scoring.yaml. It is the scoring file we used for our analyses on the paper.

In all of our scoring files, in the requirements section, we have put a "max_intron" parameter which has the effect of discarding completely any transcript that has an intron over a certain size (please refer to the documentation for more details on scoring files). That parameter is set so that any transcript with intron sizes greater than 150,000 or shorter than 20 bps (for C. elegans) will be excluded a priori. Modifying that parameter will leave transcripts with such unusual introns in your annotation.

I would not necessarily recommend such a step, however, if this is the only way of retaining loci in those regions, I would not see many alternatives.

lucventurini commented 5 years ago

Closing for inactivity. Please reply to this for further reopening.