constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
249 stars 34 forks source link

In dropseq experiment, the droplet matrix is too large to be loaded into the memory #66

Closed Pentayouth closed 3 years ago

Pentayouth commented 3 years ago

I've been skeptical that my single cell data has a very high background contamination. Thank you for developing a promising tool for sovling the problem. While it is straightforward to deal with 10X data. I'm not quite clear about how to deal with dropseq data. I found there are about 2,100,000 unempty droplets (and 20,000,000 droplets in total) in a dropseq experiment. If I calculate all the unempty barcodes, the output droplet matrix will become too large (100 GB) to be loaded on my 64GB memory machine (in other words, the tod is too large). I believe it's not the right way and you must have considered the situation. But I can't figure out how to perform it correctly.

And I noticed that in the preprint of SoupX, dropseq data was handled, and in #11 you'd mentioned that using dropseq data is feasible. Would you mind if you gave a more detailed explanation?

Pentayouth commented 3 years ago

I found in your preprint droplets with umis from 2 to 10 were used to calculate background profile. I applied >=3 threshold (since dropseq pipeline doesn't support a maximun threshold) and reduced the number of droplets from 2,095,200 to 492,585, which is still a super big matrix but anyway readable by data.table::fread(). I wonder if there is a better solution. Any comments or suggestions are highly appreciated. Pentayouth.

Pentayouth commented 3 years ago

I successed and the result is very exciting!Thanks again for developing the tool!

constantAmateur commented 3 years ago

I'm glad you solved your problem. For the reference of anyone else with a similar problem, the expectation is that the input matricies will be in sparse matrix format. It quickly becomes infeasible to load into memory any other way, especially for the large matrix of empty droplets.