Directly import transcript abundances from salmon/kallisto output

jamespblloyd commented 7 years ago

Directly import transcript abundances (with bootstraps) from salmon output files (.sf) rather than having to import the counts into formatted datatables. This would be very useful. Especially for avoiding the very large sleuth object needed to be loaded on a machine with very limited memory.

EDITED by developer to add action points:

[x] Wrapper function to import from kallisto .h5 files, using sleuth_prep.
[x] Wrapper function to import from salmon/sailfish files using wasabi and sleuth_prep.

fruce-ki commented 7 years ago

quant.sf files from salmon don't contain the bootstrapped estimates. They are delimited text tables that only contain the averaged abundances.

Using the quant.sf file with RATs is very easy with the existing provisions:

# Load two files
sfA <- as.data.table(read.delim("pathA/quant.sf"))
sfB <- as.data.table(read.delim("pathB/quant.sf"))

# Subset the tables.
mydtu <- call_DTU(annot= myannot, count_data_A= sfA[, .(Name, NumReads)], count_data_B= sfB[, .(Name, NumReads)])

With regards to reading the salmon and kallisto bootstraps directly, they use different binary formats that I know nothing about. Given that wasabi and sleuth already provide converters/parsers, I am not inclined to re-invent the wheel. These parsers are written by the salmon and kallisto people respectively, and as such I trust them more than anything I'd write. At best I could write a wrapper function that calls them for you...

I will think about whether it is worth writing a wrapper function and adding wasabi and sleuth as package dependencies.

In the meantime, you don't need to run the DTE step of sleuth for RATs to be able to extract data. Just do the data import with sleuth_prep(). That creates an object that has everything RATs needs, without the weighty model fitting of DTE. It is a bit of code overhead, but you can save that object for later re-use. Implementing a bespoke parser won't save you much typing. You'd still need to specify the multiple replicate paths for each condition.

fruce-ki commented 7 years ago

Reading back, I realised the .sf solution is missing replicates. So maybe there is scope for a wrapper that reads and merges multiple tables.

fruce-ki commented 7 years ago

Another issue with using quant.sf directly is that TPMs are too small for the count-based tests in RATs. They'd need to be scaled up to the sample size, though it is unclear what that size should be based on the .sf file. On the other hand NumReads are presumably not normalised, so transcript length and sequence biases could skew the DTU results. Kallisto/wasabi/sleuth's est_counts are a better representation of the abundances.

jamespblloyd commented 7 years ago

Okay, so it sounds like going via the Sleuth object is for the best for various reasons then. Thanks for looking into this and I will continue to use the sleuth objects as inputs into RATs given how it deals with bio reps, normalization and bootstraps. Thanks!

fruce-ki commented 7 years ago

Even without fitting models, the sleuth object is pretty huge. In a human dataset I tried, the vanilla sleuth with "just" the data takes up 8GB when the extracted bootstrapped abundances take up "only" 800MB. This is some very serious overhead! So at some point, when I have time to play, I should look into working out a solution that bypasses the sleuth loader altogether. Probably not in time for the next release though.

bartongroup / RATS

Directly import transcript abundances from salmon/kallisto output #23