EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
94 stars 18 forks source link

Fast ORF caller for mikado #337

Closed urmi-21 closed 3 years ago

urmi-21 commented 3 years ago

Hi, I am opening this thread to discuss the role of ORF data in the mikado pipeline. Mikado tutorial recommends transdecoder to find potential ORFs in the data. The first issue with transdecoder is that it is extremely slow. The second issue is that it may NOT be well suited for the identification of transcripts coding for novel peptides as it relies on a log-likelihood based model suited for well-conserved genes. Third issue is that transdecoder doesn't have good documentation and its output is not always clear and intuitive (no flexibility or control over ORF searching).

To solve these issues I wrote my own ORF caller, orfipy, to find potential ORFs in transcriptome data. orfipy is extremely fast even on very big transcriptomes (500 times faster than transdecoder for example below; 3-5 times faster than getorf) and can return many types of ORFs as desired by the user. It provides a number of options hence more flexibility to the user e.g. user can just return longest ORF in each transcript.

orfipy can be used directly with mikado as i have provided an option to output in bed12 format, same as transdecoder.predict (example below).

A small example using maize data shows using just complete ORFs i.e. ORFs in a transcript bounded by ATG and [TAG,TAA,TGA] can be as good as using transdecoder. I think orfipy will be very helpful to mikado users and make their pipelines finish faster especially when they are dealing with large amounts of data.

@lucventurini I would like to know your perspective on this. I will really appreciate it if you could also include orfipy in mikado guide to let users have a much faster ORF caller (if you agree with my point). Please let me know what you think and if you have any questions.

A small reproducible example

TL;DR: Using orfipy instead of transdecoder gave slightly better results (as compared to reference annotation). orfipy is ~500 times faster and mikado retained more transcripts when provided with orfipy ORFs. Many of these transcripts were in intergenic regions which could be possible de-novo genes.

I used RNA-Seq from maize (SRR3098744,SRR3098746) and assembled transcriptome. Steps to reproduce this are mentioned here. After getting BAM and GTF perform the following steps:

  1. Merge the bam files: sambamba merge -t 27 maize_merged.bam SRR3098744.bam SRR3098746.bam
  2. Get junctions from portcullis: portcullis full --threads 28 --output portcullis_out Zea_mays.B73_RefGen_v4.dna.toplevel.1_10.fa maize_merged.bam
  3. Mikado configure: mikado configure --list list.txt --reference Zea_mays.B73_RefGen_v4.dna.toplevel.1_10.fa --mode nosplit --scoring plant.yaml --junctions /portcullis_filtered.pass.junctions.bed --seed 123 -t 25 --minimum-cdna-length 100 mkconf.yaml
  4. mikado prepare --json-conf mkconf.yaml -m 100 -p 1 --seed 123 5a. Get ORFs from orfipy: orfipy mikado_prepared.fasta --min 100 --outdir orfipy_out --bed12 orfs.bed --include-stop --start ATG 5b. Get ORFs from transdecoder: TransDecoder.LongOrfs -t mikado_prepared.fasta -m 33; TransDecoder.Predict -t mikado_prepared.fasta 6a. Run serialise and pick using orfipy orfs: mikado serialise --json-conf mkconf.yaml --orfs orfipy_out/orfs.bed -nsa -p 25 --max-objects 1000000 --seed 123; mikado pick --json-conf mkconf.yaml --procs 25 --shm --mode nosplit --seed 123 [move the serialise and pick output files to a different folder] 6b. Run serialise and pick using transdecoder orfs: mikado serialise --json-conf mkconf.yaml --orfs mikado_prepared.fasta.transdecoder.bed -nsa -p 25 --max-objects 1000000 --seed 123; mikado pick --json-conf mkconf.yaml --procs 25 --shm --mode nosplit --seed 123
  5. Run mikado compare for both versions

Number of ORFs

orfipy reported 1,151,687 ORFs. Transdecoder selected only 293,184 ORFs

Runtimes for orfipy and transdecoder

Runtime for orfipy (step 5a) was 11 seconds. Runtime for transdecoder (step 5b) was 1 hour and 20 minutes Runtime of steps 6a and 6b did not differ significantly even though in 6a a lot more ORF data was loaded by mikado.

Comparison with Maize reference

The comparison stats generated by mikado compare were very similar. Examining the codes present in .tmap file revealed that using orfipy ORFs mikado recovered more multi-exonic and mono-exonic transcripts with complete-match. Using orfipy ORFs also retained more transcripts labelled as u, some of which could be potentially novel peptide coding transcripts.

ofipy.tmap top 5

21954 =
11590 j
5233 _
1367 J
1313 u

transdecoder.tmap top 5

21587 =
11434 j
5035 _
1326 J
1049 u
lucventurini commented 3 years ago

Hi @urmi-21

I had a look at the project and it really looks like good, clean code. And thank you for making it compatible with mikado from the start!

Now, my main concern here is that the tools we use in our pipeline (prodigal or transdecoder) do not just call all possibile ORFs, they also perform an additional scoring step to discard spurious ORFs.

A key detail that I'm not clear about here, in particular, is what happens in orfipy when there are multiple potential ORF translations on different frames. This matters because Mikado's internal algorithm when there are competing available ORFs for a transcript is to simply order by completeness then length. This is OK, I think, as long as the orf caller does some filtering before hand..

As you note in your (very detailed, thank you!) walk through, mikado can use orfipy very well, so it might not matter in the end. However, I think it's a point that deserves consideration and maybe further study?

I will have a chat with @swarbred and @ljyanesm and we'll see whether and how to insert orfipy in our pipelines. For the time being, thank you!

lucventurini commented 3 years ago

PS: you might be interested in having a look at Prodigal. In our experience Prodigal performs worse than TransDecoder in terms of correctly identifying the proper ORFs (unsuprisingly given that it was developed for Bacteria) but it is indeed orders of magnitude faster.

To be very clear: I am personally excited to see a tool that improves so much on the other "trivial" ORF callers out there, and if it were possible to enhance its performance by adding a self-training and selection module, ala transdecoder but faster, it would be a great enhancement for our pipelines!

evewurtele commented 3 years ago

@urmi-21 Thank you for developing orphipy. I think many define an ORF as a start to stop codon, and (incorrectly) assuming that's what the various ORF predictors are calling. Orphipy makes calling ORFs a transparent process, clarifying what type of ORF is being identified and enabling flexibility in choosing how ORFs are called.

urmi-21 commented 3 years ago

@lucventurini You have got valid concerns:

Now, my main concern here is that the tools we use in our pipeline (prodigal or transdecoder) do not just call all possibile ORFs, they also perform an additional scoring step to discard spurious ORFs.

The main idea is that the available tools may discard novel ORFs such as de-novo ORFs or smallORFs as these ORFs have different features than the known well conserved ones. One reason I don't want to filter my ORFs is that I am primarily interested in identifying de-novo genes. Besides, if you see expression of same locus in multiple samples there is higher chance that it is truly expressed by the cell.

A key detail that I'm not clear about here, in particular, is what happens in orfipy when there are multiple potential ORF translations on different frames. This matters because Mikado's internal algorithm when there are competing available ORFs for a transcript is to simply order by completeness then length.

I also thought about this as I have no knowledge how mikado is choosing the best ORF internally. But what I see from transdecoder is that, It also outputs multiple ORFs per transcript. The orfipy ORFs are a superset of transdecoder ORFs. So for now I kind of rely on mikado to choose the best one (and it chooses the longest complete ORFs which makes sense). orfipy can directly output longest orfs per transcript if asked to do so.

PS: you might be interested in having a look at Prodigal. In our experience Prodigal performs worse than TransDecoder in terms of correctly identifying the proper ORFs (unsuprisingly given that it was developed for Bacteria) but it is indeed orders of magnitude faster.

I think I used Prodigal before but it didn't work very well in terms of results.

To be very clear: I am personally excited to see a tool that improves so much on the other "trivial" ORF callers out there, and if it were possible to enhance its performance by adding a self-training and selection module, ala transdecoder but faster, it would be a great enhancement for our pipelines!

The real question is do we really want to judge the ORFs with evidence of expression? There are a number of papers now which have reported expression and translation of novel short peptides. All these would be labeled as "non-coding noise" by conventional gene prediction software. I agree there may be still room for errors, spurious transcriptions, contamination etc. but since now we can deal with so much data, It is worth looking at all the possible expressed ORFs.

Some interesting papers regarding novel genes Maximizing prediction of orphan genes in assembled genomes Genetic Novelty: How new genes are born De novo gene birth Long non-coding RNAs as a source of new peptides

Now, overall I agree than some people might want to be very stringent and select ORFs based on a filtering criteria, but others looking for expression of novel peptide coding genes may want to retain even those which might be discarded by transdecoder. I am looking forward to what you discuss with your team. BTW i think another great way to remove spurious transcripts could be to some how weight transcripts based on their expression estimates across samples and get rid of the ones sparsely expressed or as defined by user.

lucventurini commented 3 years ago

I also thought about this as I have no knowledge how mikado is choosing the best ORF internally. But what I see from transdecoder is that, It also outputs multiple ORFs per transcript. The orfipy ORFs are a superset of transdecoder ORFs. So for now I kind of rely on mikado to choose the best one (and it chooses the longest complete ORFs which makes sense). orfipy can directly output longest orfs per transcript if asked to do so.

Mikado actually needs multiple ORFs per transcript. In many cases, especially with Illumina, RNASeq can kick up chimeric transcripts or intron retention events (which then have multiple ORFs due to stops within the retained intron). Mikado actively uses this information to break up chimeric transcripts and recover as many real transcripts as possible.

So having orfipy report only one ORF per transcript would not be desirable in our view. On the other hand, selecting for the longest complete ORF on a transcript will probably NOT be correct in many cases.

I think I used Prodigal before but it didn't work very well in terms of results.

Agreed, we mostly use it because we need to have something faster than TransDecoder for large projects. Although @ljyanesm 's parallelisation work (unpublished) seems to have ameliorated the situation with TransDecoder.

Now, overall I agree than some people might want to be very stringent and select ORFs based on a filtering criteria, but others looking for expression of novel peptide coding genes may want to retain even those which might be discarded by transdecoder. I am looking forward to what you discuss with your team.

It depends on what you want to achieve. Mikado is a tool developed to automatise and formalise many of the processes that would be done manually by human annotators, and its purpose is not, and never has been, to provide a comprehensive catalogue of all possible transcripts in an organism. The purpose of the tool is to automatically create a semi-curated catalogue, which would exclude transcripts that are real (e.g., retained intron events) but are in general of little interest as they are likely to be discarded by the cell itself through its clean-up mechanisms. This is also why Mikado tries to identify and mop-up short transcripts around selected loci ... we are, in short, trying to identify UTR run-ons, gneomic contamination, and the like, and remove them.

On the other hand, you are right, many experimenters are interested in getting these transcripts out.

BTW i think another great way to remove spurious transcripts could be to some how weight transcripts based on their expression estimates across samples and get rid of the ones sparsely expressed or as defined by user.

We do this in minos, a pipeline which uses Mikado to this end under the hood.

urmi-21 commented 3 years ago

Hi @lucventurini, I wanted to get back to you sooner but I was really busy with wrapping up the semester.

Thanks for the explanation on mikado requiring multiple ORFs. I am guessing to identify the chimeric or fusion transcript mikado relies on blast results to identify the best-suited ORFs. I am really interested to know the details in order to fully fine-tune my mikado pipeline. Few things are still not clear to me (sorry I am still reading the documentation) like if there are no ORFs how will mikado score those transcripts or if there are multiple ORFs how the best is selected if I don't provide blast XML how does that affect the pipeline?

As I reported above and I have seen before, there is not a huge difference in terms of picked transcripts when using transdecoder or using orfipy (but there is a huge difference in runtime). One thing to note is that I ran the example in no-split mode, without blast hits. Maybe in another mode, there will be a difference?

It would be great if you can make the user aware of these choices so they can best choose what kind of transcripts they want while also saving computational time. In my experience, where my goal was to identify transcripts coding for potential novel peptides across thousands of samples, orfipy worked so much better than transdecoder (both in terms of speed and accuracy). I was in a position that I could compare results with the reference annotation and I found results (from orfipy and transdecoder) to be very similar. As the developer, you would know more about what differences to expect.

I understand that your intention with mikado was to provide a tool to provide a good set of transcripts to improve annotations. From a user's perspective, it would be great to be aware of the different choices one can make to fine-tune the predictions.

I will take a look at minos, thanks for sharing.

PS: I found this in the documentation (https://mikado.readthedocs.io/en/latest/Usage/Serialise.html The yellow panel describing BED12 files): ORF files, this means that the CDS should start at the 7th column and end at the 8th (stop-codon exclusive, like in TransDecoder)

Transdecoder.predict output bed files include the stop codon, so I guess it should say stop-codon inclusive?

xiekunwhy commented 2 years ago

Hi all,

Do you think transuite (https://github.com/anonconda/TranSuite) is a good alternative to transdecoder?

Best, Kun