COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
772 stars 162 forks source link

Request: mem-mapped or shared-mem index for `salmon quant` #335

Open BenLangmead opened 5 years ago

BenLangmead commented 5 years ago

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? Either

Describe the bug This is really a feature request -- apologies if it shouldn't go here.

Bowtie and similar tools (e.g. STAR) can use memory mapping or shared memory for the genome index. This has a nice benefit: in scenarios where N Bowtie processes simultaneously aligning to the same index on the same system, the index memory footprint is incurred once (not N times).

I may soon be running many simultaneous salmon quant processes on the sam system, all quantifying with respect to the same (human) transcriptome index. The memory footprint is around 3GB, which adds up when there are many salmon quantss. I don't expect to have lots of free RAM on this system, since other simultaneously-running processes will be aligning and incurring a much larger footprint (but using shared memory).

If salmon used memory mapping or shared memory for the index, I basically wouldn't have to worry about the peak memory footprint breaking the budget. Hence the request!

To Reproduce Steps and data to reproduce the behavior:

Expected behavior A good way to test if it's working is to run simultaneous processes w/ same index and check the SHR column in top. If it's working, that should should be some large size.

Screenshots N/A

Desktop (please complete the following information): I'm mainly interested in seeing this feature for Linux, but it's great if it works elsewhere too. Some of these mechanisms are more portable than others.

Additional context N/A

rob-p commented 5 years ago

Hi @BenLangmead!

Thanks for the formal feature request. This is, indeed, a great idea, and something I've been interested in for quite a while. As far as I can tell, the main impediment to this is the hash table (https://github.com/greg7mdp/sparsepp) used in the index. The suffix array used by the mapping algorithm (by virtue of simply being a flat array of either 32 or 64-bit integers) is trivial to load via shared memory, as is the flat representation of the concatenated text itself. The bitvector and rank data structure that separate individual transcript sequences might be a bit trickier, but is also small enough to exist per-process. However, it's unclear to me if there is an easy or straightforward way to have the hash table reside in shared memory, and this is usually the single largest element of the index.

As I mentioned, this is a feature that I've thought would be very useful for quite a while, and I'm interested in seeing it implemented. If you have any suggestions on what might be the best approach, I'm all 👂s.

jdidion commented 1 year ago

+1

Trying to quantify ~2000 Smart-Seq2 samples. Currently takes about 5 days on a single node doing 1 cell at a time.

Perhaps an easier way to implement this would be to provide a batch mode such that you load the index once and then serially quantify a batch of N samples within the same process. This would save the significant overhead of having to load the index for each sample (~50-75% of the total per-sample processing time). As a bonus, the batch mode could spit out a single transcript x sample matrix so you wouldn't have to run quantmerge separately.