EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
97 stars 18 forks source link

Combining multi-sample GTF files #344

Closed mourisl closed 3 years ago

mourisl commented 3 years ago

Thank you for developing Mikado. After developing CLASS2 , we also developed PsICLASS (https://github.com/splicebox/PsiCLASS) to assemble transcripts for a single RNA-sample or multiple RNA-seq samples simultaneously. When processing multiple RNA-seq samples, PsiCLASS also reported the consensus meta GTF file from all the samples, and the current strategy is to just vote based on the abundances. This voting strategy was powerful and outperformed other available mergers, but it is pretty naive and we hope a more sophisticated approach could give better results. Though Mikado was not designed to merge multiple-sample GTFs, it can take multiple GTF files as input. I just gave it a try on the simulated data of 25 samples. Mikado reported decent results with minimal input (just the sample-wise gtf files, no BLAST, no ORF), and the results were slightly better than TACO already. Therefore I plan to explore the potential of Mikado a bit more, and have several questions:

  1. Is possible for Mikado to read in the transcript abundance information from the input GTF files, such as "TPM", "FPKM" for scoring?
  2. Is there a limit on the number of GTFs as the input?
  3. Is the configuration file "mammalian.yaml" right for human transcript analysis?

I also found a typo in https://mikado.readthedocs.io/en/latest/Tutorial/#mikado-pick: Should it be "--subloci-out" instead of "--subloci_out"?

Thanks, Li

lucventurini commented 3 years ago

Dear @mourisl

Thank you for your interest in our tool, and for taking the time to thoroughly evaluate it against other approaches. We will try, in kind, to be as helpful as we can.

In order, for your questions:

  1. Not directly at the moment. Different tools report abundances with different terms ("Abundance" for the original CLASS, "FPKM" for Cufflinks, etc.). It is also unclear whether the information would be at all informative when mixing different assemblies, although of course this problem is not present when collapsing multiple-samples runs of the same tool. We do offer a way to consider the abudance though, see below.
  2. No, no hard limit. We've used with a fairly large number of GTFs, and recent improvements under the hood should have made Mikado resilient to a large increase in this sense.
  3. Yes, that scoring file is absolutely the most appropriate, but see below.

In regards with transcript abundance, the method we use relies on either calculating the abundance post-prepare with e.g. kallisto, or in gathering the abundance data manually. It can then be fed into mikado serialise as an "external scores" samplesheet. Please see our documentation for details on the process.

Please note that, as explained in the section, in order for Mikado to actually use the abundances so provided, the scoring file needs to be amended appropriately.

I also found a typo in https://mikado.readthedocs.io/en/latest/Tutorial/#mikado-pick: Should it be "--subloci-out" instead of "--subloci_out"?

Yes, absolutely, thank you for pointing it out!

swarbred commented 3 years ago

I would just add that while you can run mikado without the additional portcullis, blast or orf files it's not really advised and not really the point of the tool at the very least the orf files should be used.

mourisl commented 3 years ago

@lucventurini Thank you for providing the details and documentation of the scoring file. I will modify the file accordingly and will let you know the results. @swarbred Thank you for the explanation. I actually tried blast file, but it did not affect the result. I will try the ORF file.

lucventurini commented 3 years ago

Dear @mourisl

We should have solved it in #354. We will update the documentation accordingly, but in a nutshell, adding scoring parameters with the form of "attributes.{MY_PARAMETER}".

So for example for TPMs, attribute tpm (case sensitive):

attributes.tpm:
  - default: 0
  - rescaling: max
  - rtype: float
  - use_raw: false

I hope this helps.