EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
97 stars 18 forks source link

Mikado v2 does not perform as well on *H. sapiens* data #225

Closed lucventurini closed 5 years ago

lucventurini commented 5 years ago

Currently, the results obtained by Mikado v2 on rerunning it on the H. sapiens data prepared for the article are much worse than the published ones. This is in stark contrast to results obtained in the other three test species.

Potential culprits:

A key will be to re-inspect the files on the EI cluster, when possible.

lucventurini commented 5 years ago

Hi @swarbred , @gemygk , I redid the analysis using Mikado v1.0.2 and the results were still much worse than those published.

This basically categorically excludes that the problem stems from regressions in the Mikado code. Which is a relief. I also excluded that the problem was using the alternative "human.yaml" rather than "hsapiens_scoring.yaml".

There are only five possible explanations at this point:

I am working on hypothesis no. 1, ie wrong Porcullis junctions, for a very simple reason: when I inspect the RefMap, I see clearly that the main issue is that ~40% of = have switched to n, ie a UTR extension (specifically, UTR, as I do seem much better results when considering only the coding part).

So I am now testing Mikado using only the portcullis junctions that are actually present in the reference. If my hypothesis is right, I should see a massive increment of correct gene models.

lucventurini commented 5 years ago

Update: portcullis junctions were not the problem. The only thing left is to re-inspect the files on the EI cluster to pinpoint the origin of the problem.

swarbred commented 5 years ago

@lucventurini

Can i clarify the issue

You have tested v1.0.2 and v2 using what you believe is the same input and config files as the published results and for human in both mikado versions the new results are worse?

Firstly how do these compare i.e. how much worse than the published results

What do the arabidopsis results look like for the same comparison?

We did put some specific changes in the configuration, that we did not document accurately, and are not reflected in the code or the article.

That's possible, while it would be good to be able to reproduce the human results in the paper, i'm not that concerned. We spent time adjusting the human run as both the reference annotation and input reads had characteristics that required us to adjust the config for human in a way that would not necessarily be suitable for a generic mikado run. I was fine with that as really this is how mikado is designed to be run i.e. we have some default configs that represent a good starting place but for individual projects I expect to sometimes have to adjust these based on running pick and assessing the results.

My second query is on the new 1.02 and v2 runs how do they compare with each other i.e. is the v1.02 run giving "better" results (when the config is the same) if so can you share the mikado compare results.

Again is this an issue only on human or on arabidopsis as well.

loading the results and looking in a browser would probably allow me to make a judgement on this. Perhaps we can load the runs you have done.

A mikado version being poorer on the human test data doesn't necessarily indicate that the improvements or different config in the v2 release are not desirable it's just not best for this specific dataset.

lucventurini commented 5 years ago

Hi @swarbred,

You have tested v1.0.2 and v2 using what you believe is the same input and config files as the published results and for human in both mikado versions the new results are worse?

Not exactly. I generated the Daijin/Mikado configuration files through command line (ie mikado configure --list star.txt [..] etc.), specifying as inputs:

Firstly how do these compare i.e. how much worse than the published results

The comparison results have been downloaded from FigShare and double-checked with the paper. We go, in STAR, from a Transcript F1 of 18.82% (paper) to 8.36% (1.0.2 redone from scratch on my computer at NHM) to ~7% (2.0rc6, cannot completely confirm on this one as I have been trying lots of stuff, but it is about the right result).

So less than half as accurate. Most of the problems stem from the fact that a lot of transcripts (~3000) end up with an extension (code n) rather than an equivalent intron chain (code =). So not a horrible result, but it dramatically impacts the accuracy topline.

What do the arabidopsis results look like for the same comparison?

We go from an F1 of 56.44% to an F1 of 59.12%. So we improve.

My second query is on the new 1.02 and v2 runs how do they compare with each other i.e. is the v1.02 run giving "better" results (when the config is the same) if so can you share the mikado compare results.

See above (we improve on A. thaliana) . I will attach the mikado compare results shortly.

loading the results and looking in a browser would probably allow me to make a judgement on this. Perhaps we can load the runs you have done.

Probably that would be for the best, yes.

A mikado version being poorer on the human test data doesn't necessarily indicate that the improvements or different config in the v2 release are not desirable it's just not best for this specific dataset.

It's a bit of a problem if we want to release an update note, the fact that it markedly decreased its performance on one of the original data sets. We can argue our way around that I guess, but it's not ideal.

swarbred commented 5 years ago

The comparison results have been downloaded from FigShare and double-checked with the paper. We go, in STAR, from a Transcript F1 of 18.82% (paper) to 8.36% (1.0.2 redone from scratch on my computer at NHM) to ~7% (2.0rc6, cannot completely confirm on this one as I have been trying lots of stuff, but it is about the right result).

Is the worse F1 due to a lower precision ? Do we have many more genes and or transcripts in the new runs?

Are the new runs done without the padding?

The arabidopsis results you mention seem positive i.e. better in v2. The arabidopsis set was always the better test dataset as the human gencode annotation has been annotated in a way that is not typical of most gene sets (I did some of it) also the RNA-Seq data I always viewed as less clean.

In assessing between versions I would place more weight on the arabidopsis dataset.

It's a bit of a problem if we want to release an update note, the fact that it markedly decreased its performance on one of the original data sets. We can argue our way around that I guess, but it's not ideal.

From your numbers the difference between versions is 8.36% (1.0.2) to ~7% (2.0rc6) so not a big difference, the paper figures seems to reflect some differences in the config that we haven't captured. I imagine we could adjust the scoring in a way that improves the final F1 i.e. tailor it to this specific data but that probably results in a config that is really only useful/best for this data. If we want a human test dataset then at some time it would make sense to put this together from RNA-Seq / long read data that I/we have more confidence in. I've got data that we could use for this. We don't have to use the same dataset as the paper to compare versions.

lucventurini commented 5 years ago

Is the worse F1 due to a lower precision ? Do we have many more genes and or transcripts in the new runs?

We have more genes but much less transcripts (about 30% less). I attach the comparisons so you can have a look. Both precision and sensitivity take a whopping.

Are the new runs done without the padding?

Both with and without (for A thaliana, it is with). It does not make much of a difference in terms of accuracy.

From your numbers the difference between versions is 8.36% (1.0.2) to ~7% (2.0rc6) so not a big difference, the paper figures seems to reflect some differences in the config that we haven't captured. I imagine we could adjust the scoring in a way that improves the final F1 i.e. tailor it to this specific data but that probably results in a config that is really only useful/best for this data. If we want a human test dataset then at some time it would make sense to put this together from RNA-Seq / long read data that I/we have more confidence in. I've got data that we could use for this. We don't have to use the same dataset as the paper to compare versions.

That would be for the best I think.

comparisons.zip

lucventurini commented 5 years ago

Update: I have rerun completely the data for human, ie I've redone alignments, assemblies etc. on our cluster. Nothing changed, mikado still greatly under performs compared to the paper, and the most common "mistake" consists in choosing models that have extra exons (n). I think I will leave it as it is for the time being.

@swarbred if we could at some point run mikado on the datasets you were hunting at, we could have a better case for the update.

lucventurini commented 5 years ago

@swarbred @gemygk

Please ignore this issue. I had another look and - very embarrassingly - the reason for the far lower performance was the fact that I must have indexed the reference.gff3 and filtered.gff3 files while instructing mikado to strip the UTR. That is the reason for the predominant n class code.

Once I corrected this user error, the problem became much more subdued - there is still a discrepancy but it is due to the fact that the default parameters are very conservative in calling an alternative splicing event.

It might still be worthwhile to re-check the exact method used for the paper, but at least now I know that the performance has not completely sunk.

Closing.