EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
97 stars 18 forks source link

Question about using Mikado with lncRNA annotations. #394

Closed cc-prolix closed 3 years ago

cc-prolix commented 3 years ago

Hello!

I have a question regarding Mikado: I have a set of novel lncRNAs generated through StringTie and want to merge them with the current NONCODE annotation. However, during the Mikado Tutorial information of the transcript ORFs from TransDecoder is used in the workflow. Could I still use Mikado to merge my annotations, even if lncRNAs lack a significant ORF?

Thanks for your help!

swarbred commented 3 years ago

Sorry i'm not clear exactly you mean by NONCODE annotation

This is what I'm assuming

You have a subset of stringtie models that you have classified as lncRNA + you have some existing ncRNAs from another source and want to simply bring these together into a single annotation.

If that is the case yes you could run mikado with no ORF information, mikado would group transcripts from across both sets into subloci and then select a primary transcript from across all models at the subloci and then retrun valid alt splice variants of the primary transcript. You would want to check the scoring file and splicing part of the config is scoring transcripts in a way that makes sense for your project, without ORF/BLAST/Portculllis you are basically scoring models on cDNA length and exon attributes (unless you provide additional external metrics).

If you have a mix of coding models and non-coding models as input to mikado i.e. you have stringtie models that you have called ORFs for then you may want to separate these into separate input files and prioritse them by applying a base score in the list.txt file that defines the input transcripts.

We used mikado to integrate potential lncRNAs we had called with existing protein coding genes, we had two sets of lncRNA one which we had high confidence so gave a score e.g. 1000 so these would be selected over any protein coding models that might form part of the same subloci and another which we prioritised below the protein coding models so we gave the coding models a score of 100 effectively preventing these "lncRNAs" from being selected over a protein coding model if they fall into the same subloci.

cc-prolix commented 3 years ago

Hey, Sorry, I should have provided a little bit more information: NONCODE is a integrated database dedicated to lncRNAs, they provide annotation files for their collected data... Yes, I want to merge my stringtie transcripts, which I classified lncRNAs with the existing NONCODE annotation.

Thank you very much for the Info, that helped a lot=)