Harmonise output format to that of Salmon

gringer commented 8 months ago

Oarfish produces three files as output:

\<basename>.meta_info.json - metadata / mapping summary
\<basename>.quant - tab-separated gene / transcript counts [tname, len, num_reads]
\<basename>.infreps.pq - bootstrap replicates file

These files are similar to the output of Salmon, but not the same (e.g. it doesn't use quant.sf), so the .quant files will need to be manually converted into a count matrix for processing using DESeq2 (or a similar program).

It would be helpful, given that its the same lab producing these files, that the output of these programs could be harmonised, so that it can be used directly by any program that can process Salmon output.

https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#transcript-abundance-files-and-tximport-tximeta

rob-p commented 8 months ago

Hi @gringer,

Thanks for the input and suggestion. I'm open to discussion, but also want to allow tools to evolve over time. For example, I think the infreps.pq is a strictly better solution than what salmon provides. For example, it uses a standard, well-supported data format (Parquet) as opposed to a custom binary format. In fact, support for reading parquet inferential reps is already present in tximeta as part of it's support for piscem-infer.

I'd already mentioned this to @mikelove, but I think we should definitely add native support for oarfish to tximeta. One other important thing about changes in piscem-infer and oarfish is that, based on previous user feedback (with respect to salmon), we've moved from having the output all live in a specific directory by necessity to simply having the output be a file stem which the multiple output files share. If the provided stem includes a new directory name, then that will be properly created. However, after several discussions with heavy salmon users, there was broad agreement for preference of the new idiom over the old one.

--Rob

gringer commented 8 months ago

Native support for oarfish being added to tximeta would be great.

Thanks for explaining this. I understand the issues with different user bases, I just had a faint hope that things might be more malleable with two similar tools being created by the same research lab.

mikelove commented 8 months ago

Added an issue:

https://github.com/thelovelab/tximeta/issues/81

Rob maybe you can throw up 1-2 quantified samples somewhere?

mikelove commented 8 months ago

Oh I noticed/remembered there is currently no digest of the reference sequence in these output files.

You can use this new cut of tximport and skipMeta=FALSE in tximeta to build an SE. tximeta just passes type to tximport so this all should work without any changes to tximeta. (changes will be needed in tximeta once we get to reference digests and identification)

rob-p commented 8 months ago

Thanks @mikelove,

So the issue here is that oarfish uses minimap2 alignments as input, so it may never even see the transcriptome. How do you suggest we handle this. We could have a signature based on transcript names and lengths (present in the BAM/SAM header), or we could add an oarfish command to allow the user to add a signature to a quantification result, but then that introduces a user-dependent step and is error prone.

--Rob

mikelove commented 8 months ago

We should definitely adopt the proposal ideas (so the former). We can start to implement the GA4GH digest even. Lets chat this week

mikelove commented 8 months ago

And to be clear, latest GitHub of tximport will work with oarfish

COMBINE-lab / oarfish

Harmonise output format to that of Salmon #13