aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

Chunking cells for better memory usage in aucell #99

Open gokceneraslan opened 4 years ago

gokceneraslan commented 4 years ago

Hi, thanks for the great implementation! I'm trying to run aucell on 100k cells and 30k genes with some gene sets and always getting memory error (with num_workers=1).First, I need to convert the sparse expression matrix to a dense DataFrame, which increases memory usage a lot. Then, DataFrame of ranks, which is of the same size, is produced and it doubles the memory usage which crashes my VM.

I was wondering if it would make aucell more memory efficient to chunk cells and calculate AUCs independently in every chunk. I might be wrong, but as far as I understand calculation is embarrassingly parallel for both cells and gene sets (latter is already exploited in the code).

What do you think?

cflerin commented 4 years ago

Hi,

This is an interesting enhancement, thanks for the suggestion. In principle, yes it's possible to calculate the cellular activity in chunks, but only after the generation of the ranking matrix.

But in general, we've seen that AUCell has the lowest memory usage of the pySCENIC steps, so it's strange that you'd run into issues at this step after having completed GRNBoost2 and cisTarget successfully. Also, 30k genes seems like a lot -- have you applied some basic filters on the gene- and cell-level? This could cut your expression matrix down quite a bit.

Can you share the specs of your machine (specifically, how much memory)?

gokceneraslan commented 4 years ago

Hi,

This is an interesting enhancement, thanks for the suggestion. In principle, yes it's possible to calculate the cellular activity in chunks, but only after the generation of the ranking matrix.

But in general, we've seen that AUCell has the lowest memory usage of the pySCENIC steps, so it's strange that you'd run into issues at this step after having completed GRNBoost2 and cisTarget successfully.

Actually, I'm not using GRNBoost2 or cisTarget at all. I'm using AUCell as an independent function to get scores for gene sets I have.

Also, 30k genes seems like a lot -- have you applied some basic filters on the gene- and cell-level? This could cut your expression matrix down quite a bit.

Yes, I know I can filter cells/genes but my point is that there is a nice way to exploit the fact that cell scores are independent in order to lower memory usage without filtering or subsampling.

When I saw that SCENIC and AUCell are being reimplemented in Python, I just assumed that you're aiming to build something that can scale up to hundreds of thousands of cells, that's why I'm proposing something that'd help the effort :)

Can you share the specs of your machine (specifically, how much memory)?

There is 100GB memory in my VM and it wasn't sufficient until I split cells into chunks myself. Storing twice much space of a densified matrix is really huge...

gokceneraslan commented 4 years ago

In principle, yes it's possible to calculate the cellular activity in chunks, but only after the generation of the ranking matrix.

Wait, why possible only after the generation of the ranking matrix? Rank matrix can also be built in chunks because it's rank of genes within a cell, no? Maybe I am missing something.

cflerin commented 4 years ago

Well the ranking function performs a shuffle on the expression matrix prior to ranking so that ties are not dependent on the original column order. So we could break up the ranking step too, but it would be nice to then maintain a single shuffled state across chunks so that the ranking is performed on the same input. It would be fine to do this with a fixed seed I think. I'll look more into it after doing some memory usage tests.

Were you able to get an output after splitting up your matrix?

bramvds commented 4 years ago

Hi Gökçen, Hi Chris,

You are entirely correct in stating that the whole genome rankings for all cells in an experiment for the AUCell step is done in a non-parallel way (mainly relying on core pandas functionality). And indeed, only the subsequent scoring of these rankings for the supplied regulons is done in parallel and this to increase speed. I did not parallelize the ranking step although this could have provided a speed increase.

The goal was clearly different from your request: increasing speed versus reducing memory footprint. When the goal would be to reduce the overall memory footprint of AUCell, we could chunk up the expression matrix, rank each chunk and calculate AUCs but this then needs to be done in a sequential and not in a parallel way (otherwise there will not be a reduced memory footprint). Moreover, the best way to reduce memory footprint is to only keep the top N ranked genes for each cell (N needs to be at least >= (total number of genes profiled in the scRNAseq experiment) * the auc_threshold parameter).

This is a more substantial development effort which needs to be planned and afterwards properly tested.

Kindest regards, Bram