EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
97 stars 18 forks source link

question about transcript split #302

Closed lijing28101 closed 4 years ago

lijing28101 commented 4 years ago

Hi I'm running mikado version 2 (branch 270). I find a problem of my output. Since I got many (about 30%) incomplete CDS from mikado, I checked transdecoder output and only keep transcripts with complete CDSs for mikado serialise and pick. But I still got 25% of transcript with incomplete CDS, and over 50% for mono-exonic transcripts. I compared the structure from mikado and my original structure from stringtie and transdecoder. I found mikado trim some exons and then cause the incomplete CDS.

For example, for 3 prime_partial: image

I'm not sure whether this is caused by transcript split during pick. Do you have any idea to avoid that?

Thanks, Jing

lucventurini commented 4 years ago

Hi @lijing28101

Some possibilities:

I hope this helps. It is a bit puzzling. I would go for the last solution (mode stringent) as a first port of call.

lijing28101 commented 4 years ago

Hi @lucventurini ,

Thanks for your suggestion. Since I want to identify orphan gene, which may very short. Can I change the cutoff for loading ORFs? (change 250 nt to 150 nt). If two ORFs overlap, how to determine which ORFs will be load? By length or something else? I've tried nosplit mode, most of CDS are complete. But if a transcript have several non-overlap ORFs, how does mikado to determine the CDS?

Thanks, Jing

lucventurini commented 4 years ago

Hi @lijing28101

Since I want to identify orphan gene, which may very short. Can I change the cutoff for loading ORFs? (change 250 nt to 150 nt).

Yes, it is possible. In the configuration file, under pick.orf_loading, you can find and modify the following:

minimal_orf_length = 50
minimal_secondary_orf_length = 200  # Apologies, this (200) is the real default value, not 250

If two ORFs overlap, how to determine which ORFs will be load? By length or something else?

The default is by length. The longest ORF will be kept. Tie-breakers are solved looking at whether the ORF is complete or not. Admittedly this is not the most refined method. We are not looking at completeness first because, during the original development, we found many cases where ORF finders would locate for incomplete transcripts a spurious, short internal ORF in one of the possible incorrect frames.

I've tried nosplit mode, most of CDS are complete. But if a transcript have several non-overlap ORFs, how does mikado to determine the CDS?

Mikado will keep all non-overlapping CDSs. However, it will only report the longest in the final output. This behaviour can be changed in pick.output_format:

report_all_orfs = false  # Switch this to True

I have to say that this is indeed a weakness of Mikado - the tool kinda relies on the ORFs provided being correct. We do not do anything clever internally to validate and choose amongst the different options.

So a better way of going about this might be to aid TransDecoder by giving it BLASTP data relative to the ORFs it finds in its LongOrf step. If you are not aware on how to do it, the TransDecoder wiki has detailed instructions.

I hope this helps.

lijing28101 commented 4 years ago

Hi @lucventurini , I didn't see pick.orf_loading and `pick.orf_format' in both configure file and score file. I need add them by myself?

lucventurini commented 4 years ago

Hi @lucventurini , I didn't see pick.orf_loading and `pick.orf_format' in both configure file and score file. I need add them by myself?

Apologies, I was not very clear on my part. First off: all the fields I mentioned are in the configuration file, not the scoring file. The fields you are looking for are:

The . above (e.g. pick.orf_loading) was to indicate the hierarchical location. In case the fields are not present, please insert them in the correct location in the configuration file. Again apologies, I understand this is not as user-friendly as a command line switch. I will consider adding them to the interface of pick and/or configure.

lucventurini commented 4 years ago

Closing for now.