Open gringer opened 8 months ago
Hi @gringer,
Thanks for the input and suggestion. I'm open to discussion, but also want to allow tools to evolve over time. For example, I think the infreps.pq
is a strictly better solution than what salmon provides. For example, it uses a standard, well-supported data format (Parquet) as opposed to a custom binary format. In fact, support for reading parquet inferential reps is already present in tximeta
as part of it's support for piscem-infer
.
I'd already mentioned this to @mikelove, but I think we should definitely add native support for oarfish
to tximeta
. One other important thing about changes in piscem-infer and oarfish is that, based on previous user feedback (with respect to salmon), we've moved from having the output all live in a specific directory by necessity to simply having the output be a file stem which the multiple output files share. If the provided stem includes a new directory name, then that will be properly created. However, after several discussions with heavy salmon users, there was broad agreement for preference of the new idiom over the old one.
--Rob
Native support for oarfish
being added to tximeta
would be great.
Thanks for explaining this. I understand the issues with different user bases, I just had a faint hope that things might be more malleable with two similar tools being created by the same research lab.
Added an issue:
https://github.com/thelovelab/tximeta/issues/81
Rob maybe you can throw up 1-2 quantified samples somewhere?
Oh I noticed/remembered there is currently no digest of the reference sequence in these output files.
You can use this new cut of tximport and skipMeta=FALSE
in tximeta to build an SE. tximeta just passes type
to tximport so this all should work without any changes to tximeta. (changes will be needed in tximeta once we get to reference digests and identification)
Thanks @mikelove,
So the issue here is that oarfish
uses minimap2 alignments as input, so it may never even see the transcriptome. How do you suggest we handle this. We could have a signature based on transcript names and lengths (present in the BAM/SAM header), or we could add an oarfish command to allow the user to add a signature to a quantification result, but then that introduces a user-dependent step and is error prone.
--Rob
We should definitely adopt the proposal ideas (so the former). We can start to implement the GA4GH digest even. Lets chat this week
And to be clear, latest GitHub of tximport will work with oarfish
Oarfish produces three files as output:
\<basename>.meta_info.json - metadata / mapping summary
\<basename>.quant - tab-separated gene / transcript counts [tname, len, num_reads]
\<basename>.infreps.pq - bootstrap replicates file
These files are similar to the output of Salmon, but not the same (e.g. it doesn't use
quant.sf
), so the .quant files will need to be manually converted into a count matrix for processing using DESeq2 (or a similar program).It would be helpful, given that its the same lab producing these files, that the output of these programs could be harmonised, so that it can be used directly by any program that can process Salmon output.
https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#transcript-abundance-files-and-tximport-tximeta