aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

Chunking cells in large dataset #185

Closed Elhl93 closed 4 years ago

Elhl93 commented 4 years ago

Hi,

I have a huge single-cell dataset (>600k cells) from different animals and different conditions/timepoints. After filtering, I have ~15k genes to work with.

Although I work on a HPC, but as the dataset is quite big, I need to split it in chunks as it will improve runtime (referring to #99 ). As I have no pre-made lists as in post #99 , I need to run GRNBoost2 and cisTarget.

I did a trial in which I took one celltype, split it in 2 chunks each 40k cells with an overlap of ~10k cells. My goal was to understand wether the identified RSS scores in both 10k chunks are identical - or wether the score is relative depending on the other cells in the dataset.

  1. Do you recommend to split the dataset by sample (the dataset comprises >100 samples), or e.g. by celltype (we detect 15 celltypes), condition? - If the results depend on other cells in the dataset.
  2. By correlating the RSS scores of the above described 10k cells I got a r=0.51. Is that expected? You described that there is variability due to the probabilistic nature of GRNboost2. Do you recommend then running it multiple times (e.g. n=5) and only consider e.g. the reoccuring top 20?

Thanks for your thoughts!

cflerin commented 4 years ago

Hi @Elhl93 ,

Interesting question, and I think you have the right idea about downsampling. If you have a few conditions/timepoints, normally I would run these separately, and then also run the combined full dataset. But with a dataset this large, if you try to use the full 600k cells, it will take quite some time to run the GRNBoost2 step -- probably on the order of days to weeks, even on an HPC with many cores. So I think first, I would run the conditions separately, downsampling to maybe 100k cells if necessary. Splitting it by cell type isn't a good approach because it makes it harder to pick out TFs/regulons that are differential across cell types. This could be a reason that your RSS correlation is "low". Running the conditions separately will already give a good idea of the regulons present in your datasets.

Then you can decide if running the full dataset is worth the computation time. I would start with a random 100k cells on the combined data first. Then if you find something particularly interesting, like a few regulons, you could run the full dataset, and only specify those TFs as an input (for instance).

In general, running multiple times is helpful for refining the list of regulons and target genes, but there could be an issue in computation time with your dataset and I don't think it would affect RSS all that much unless you really aggressively prune the target genes. It's more likely that the differences in RSS are based on the definition (cell composition) of the clusters that you're comparing.