question about transcript split

lijing28101 commented 4 years ago

Hi I'm running mikado version 2 (branch 270). I find a problem of my output. Since I got many (about 30%) incomplete CDS from mikado, I checked transdecoder output and only keep transcripts with complete CDSs for mikado serialise and pick. But I still got 25% of transcript with incomplete CDS, and over 50% for mono-exonic transcripts. I compared the structure from mikado and my original structure from stringtie and transdecoder. I found mikado trim some exons and then cause the incomplete CDS.

For example, for 3 prime_partial:

I'm not sure whether this is caused by transcript split during pick. Do you have any idea to avoid that?

Thanks, Jing

lucventurini commented 4 years ago

Hi @lijing28101

Some possibilities:

an important note: Mikado would never split a transcript that has only one ORF. Moreover, Mikado will only load non-overlapping ORFs onto a transcript (ie if two possible ORFs share the same portion of the transcript, Mikado will only load one). By default, any ORF after the primary must be at least 250 bps long. So I am wondering why there are so many transcripts with more than one ORF in your dataset.
as for splitting: you can disable the splitting completely using the flag --mode nosplit when launching mikado pick (and/or change the value in the configuration file).
For a less draconian solution, you can instruct mikado pick to use a stringent algorithm for splitting transcripts (--mode stringent). In this mode, Mikado will split transcripts if and only if the two (or more) sides of the transcript match to different proteins. The other modes (permissive, lenient) are more aggressive.
You can revise the BLAST dataset. It might be the case that the proteins you are using are too divergent / fragmented (this can happen with non-curated genome annotations) leading Mikado to overestimate the number of cases where two pieces of the same gene are actually from two "different" genes.

I hope this helps. It is a bit puzzling. I would go for the last solution (mode stringent) as a first port of call.

lijing28101 commented 4 years ago

Hi @lucventurini ,

Thanks for your suggestion. Since I want to identify orphan gene, which may very short. Can I change the cutoff for loading ORFs? (change 250 nt to 150 nt). If two ORFs overlap, how to determine which ORFs will be load? By length or something else? I've tried nosplit mode, most of CDS are complete. But if a transcript have several non-overlap ORFs, how does mikado to determine the CDS?

Thanks, Jing

lucventurini commented 4 years ago

Hi @lijing28101

Since I want to identify orphan gene, which may very short. Can I change the cutoff for loading ORFs? (change 250 nt to 150 nt).

Yes, it is possible. In the configuration file, under pick.orf_loading, you can find and modify the following:

minimal_orf_length = 50
minimal_secondary_orf_length = 200  # Apologies, this (200) is the real default value, not 250

If two ORFs overlap, how to determine which ORFs will be load? By length or something else?

The default is by length. The longest ORF will be kept. Tie-breakers are solved looking at whether the ORF is complete or not. Admittedly this is not the most refined method. We are not looking at completeness first because, during the original development, we found many cases where ORF finders would locate for incomplete transcripts a spurious, short internal ORF in one of the possible incorrect frames.

I've tried nosplit mode, most of CDS are complete. But if a transcript have several non-overlap ORFs, how does mikado to determine the CDS?

Mikado will keep all non-overlapping CDSs. However, it will only report the longest in the final output. This behaviour can be changed in pick.output_format:

report_all_orfs = false  # Switch this to True

I have to say that this is indeed a weakness of Mikado - the tool kinda relies on the ORFs provided being correct. We do not do anything clever internally to validate and choose amongst the different options.

So a better way of going about this might be to aid TransDecoder by giving it BLASTP data relative to the ORFs it finds in its LongOrf step. If you are not aware on how to do it, the TransDecoder wiki has detailed instructions.

I hope this helps.

lijing28101 commented 4 years ago

Hi @lucventurini , I didn't see pick.orf_loading and `pick.orf_format' in both configure file and score file. I need add them by myself?

lucventurini commented 4 years ago

Hi @lucventurini , I didn't see pick.orf_loading and `pick.orf_format' in both configure file and score file. I need add them by myself?

Apologies, I was not very clear on my part. First off: all the fields I mentioned are in the configuration file, not the scoring file. The fields you are looking for are:

section pick
- section orf_loading under pick
- minimal_orf_length
- minimal_secondary_orf_length
- section output_format
- report_all_orfs

The . above (e.g. pick.orf_loading) was to indicate the hierarchical location. In case the fields are not present, please insert them in the correct location in the configuration file. Again apologies, I understand this is not as user-friendly as a command line switch. I will consider adding them to the interface of pick and/or configure.

lucventurini commented 4 years ago

Closing for now.

EI-CoreBioinformatics / mikado

question about transcript split #302