EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
94 stars 18 forks source link

Filtering excessive resulting transcripts #343

Closed 14zac2 closed 2 years ago

14zac2 commented 3 years ago

Hello!

Sorry for all the questions, I just really appreciate your pipeline, especially for annotating the genomes of poorly characterized species. It's great to be able to combine so much evidence!

I have just completed the Mikado pipeline using evidence form 6 annotation tools, Transdecoder, Portcullis, and BLAST. Unfortunately I ended up with about twice as many genes as I would have expected (~43,000). I used mostly default parameters and the mammalian scoring file. Now I am looking for the best way to filter lower confidence hits. Is there any recommended way of doing this? From reading the docs I can only see modifications to the scoring file as a Mikado-designed solution, but I'm not too confident in my tweaking.

Thanks so much for your time and advice - in the meantime, I'll certainly keep attempting to tweak the parameters.

Zoe

14zac2 commented 3 years ago

Hi - just following up, as I realize how naive my question sounded. Indeed when I posted it, I had not yet located these sections of your tutorial such as this which have helped immensely in filtering my genes during Mikado pick:

https://mikado.readthedocs.io/en/latest/Tutorial/Adapting/#case-study-2-noisy-rna-seq-data

However, after looking at the statistics from two closely related species and applying any possible filter to the scoring file to make gene detection more stringent during Mikado pick, I still find myself with 10,000 more genes than I would like in the resulting gff file. In my results, there are too many monoexonic genes and the UTRs are too short. I have attached my scoring file (test8.txt), because I feel like I have applied extremely stringent filters. In trying to get the general statistics that I am looking for (~20,000 genes, ~30,000 transcripts, and UTRs/exons per transcript/etc.) that make sense for my species, I am going to lose important, shorter genes.

I have determined that the pesky input source into Mikado that is adding a ton of genes to the gff output are Augustus gene predictions. This makes sense to me, because I'm sure both Augustus and Mikado priorities complete genes that make sense according to theory, but not necessarily from evidence. In trying to filter out more of these Augustus transcripts, I have lost nearly all of my input from StringTie, for instance.

At this point, I am going to run the Mikado pipeline without the ab inito models from Augustus. However, I noticed that you used Augustus in your paper, so perhaps you have experience in dealing with this issue, or have some other opinions on the influence of Augustus. I would really appreciate your insight.

Thanks again for your time, Zoe

lucventurini commented 3 years ago

Dear @14zac2

I wanted to reply this morning, and I fear that this answer will be a bit of a placeholder, for reasons I will make clear below.

You are right that using Mikado with ab initio models is liable to yield inflated results in terms of number of genes etc. This is because Mikado was created to analyse transcript assemblies that often do not look "nice" due to incomplete CDSs, retained introns, long UTRs etc. Ab initio programs such as Augustus are instead very good at returning transcripts that look "nice" for Mikado but that do not necessarily have great support in terms of evidence or biological plausibility.

For this reason, @cschu and @swarbred have developed a novel pipeline to integrate ab initio results with RNASeq analysis, using Mikado to perform the choice. The key of the pipeline is in augmenting the data available for Mikado with e.g. expression data, similarity to RNASeq models, etc.

The pipeline is available here: https://github.com/EI-CoreBioinformatics/minos

I have not yet used the pipeline myself, but @cschu and @swarbred might be able to help you out on this.

We do plan to make it into a paper and release it officially soon.

14zac2 commented 3 years ago

Hi @lucventurini - thank you so much for your honest response! I am satisfied, because you have confirmed my suspicions and I have sufficient evidence from other gtf/gff inputs to find a solid work-around. I'm very curious about Minos and appreciate your drawing my attention to it - I'll certainly look into it!

Thanks again for your feedback and advice which has been immensely helpful in my annotation project.

swarbred commented 3 years ago

Hi @14zac2

Sorry for the delayed reply, as @lucventurini has indicated Mikado's internal metrics will not on their own be suitable for predicted genes models. Mikado does allow you to provide external metrics which can be used for the scoring and filtering https://mikado.readthedocs.io/en/latest/Algorithms/?highlight=external#external-metrics . In my group we do successfully use external metrics and mikado to integrate / select gene models. We use mikado compare to generate F1 scores to show how well supported models are by assembled transcripts or cross species protein alignments and then use mikado with these scores (plus expression values and coding potential scores) and the normal internal metrics + portcullis junctions to select models.

As luca has indicated we have written this workflow into a pipeline (MINOS) that takes the models and the supporting evidence as GFF/GTF files it generates the external metrics (F1 scores, TPM, coding potential) for mikado and then runs the mikado steps + generates some stats, assigns a confidence label and qc related info. The pipeline is public but we haven't really fully released or tested it externally (the documentation is limited). If you were interested we could try to help you with getting this working.

As part of the process of testing/releasing MINOS potentially we could run your models through the pipeline here if you can provide the inputs, it would be helpful for us to get some feedback on the output (I wouldn't expect anything more than that) but I appreciate that might not work for you.

14zac2 commented 3 years ago

Hi @swarbred - not a problem, at all! Sure, I would be interested! Just as long as it doesn't add more than another week or two onto the annotation. But I am interested in this option because I am still having a bit of trouble with having too many monoCDS/monoexonic genes, which I think might be due to fragmented input from my RNA-seq data (98 bp reads). I was trying to use EvidenceModeler to parse through this info, but it seems to be a very memory and time intensive process. I have a gff file of ab initio predictions by Augustus, a blat file of spliced alignments from the RNA-seq data, and a Maker-generated file with evidence from ESTs, protein-homology and repeat masking. The RNA-seq gff file is quite massive, 101G.

How should I go about this? And thanks again for your time!

swarbred commented 3 years ago

Hi @14zac2

From your description we may be some distance apart from what you have generated and what would be a reasonable input into MINOS. MINOS-MIKADO will cherry pick from across the input models, it wont create novel models so if the input models have substantial issues the MINOS selected models will likely improve on the individual inputs but still not meet an acceptable quality.

We have in the past used EvidenceModeler if you have models from ab initio methods and more fragmented RNA-seq models it may be a better option than our pipeline. We found issues / bugs that led to whole genes being omitted and it didn't give us the control we wanted over what to include / exclude so we developed our own approaches.

Just as long as it doesn't add more than another week or two onto the annotation.

Realistically us running MINOs will take longer than this given we would need to get access to your files, assess what you have, add any additional inputs based on that assessment and run, tweek and rerun the pipeline with an xmas holiday on the horizon :-).

It sounds like you have:

RNA-Seq blat aligned transcripts (this is presumably de novo assembled RNA-Seq) Augustus ab initio prediction (i.e. without using evidence hints) Maker gene models (i.e. the final maker output) Maker aligned proteins Maker aligned ESTs Maker repeat annotation

Based on this and your comment about fragmented / monoexonic genes I wouldn't be confident that MINOS would generate a very high quality annotation and I suspect we would need to look at your RNA-Seq data and generate alternative transcript assemblies and create a MIkado run on these + align some suitable protein sets so that these could be used with your Maker and Augustus models.

You obviously have your own timeline you need to work to but my general advice is that it's worth spending the time getting the alternative gene sets and evidence alignments as good as you can.

One simple check you could do (if you haven't already done this) is to run BUSCO on the genome and on the proteins sequences derived from your Maker and Augustus models, the results from the protein sets should be better (i.e. more complete less missing genes) that the genome run. If they are not then that wouldn't reflect well on those gene sets.

Our basic :-) process is

1) align RNA-Seq reads per sample with Hisat2 2) Assemble RNA-seq per sample with stringtie2 and scallop 3) Run portcullis on the set of RNA-Seq bams 4) Run mikado with the stringtie, scallop and portcullis data (using a set of proteins for the homology scoring) 5) Align cross species proteins with spaln 6) Run mikado with the protein alignments 7) Use liftoff to transfer gene models from closely related species (i.e. same genus, family) 8) Multiple Augustus runs to generate evidence guided models (i.e. using the transcript and protein models) 9) Run MINOS to integrate and select from the models derived in 4, 6, 7, and 8

This is currently multiple pipelines/workflows and there are obviously steps/filtering around this but that's our approach.

Feel free to email me at david.swarbreck@earlham.ac.uk if you want to discuss further.