GRN on Large Dataset - 999 GB and 16 cores

aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.

http://scenic.aertslab.org

GNU General Public License v3.0

445 stars 183 forks source link

GRN on Large Dataset - 999 GB and 16 cores #426

Open Sayyam-Shah opened 2 years ago

Sayyam-Shah commented 2 years ago

Hello,

I am trying to run the grn step on a large dataset of 263159 cells and 14357 genes on an HPC cluster with 999 GB of memory and 16 cores. I tried increasing the cores, but it would cause memory allocation problems. It started running, however, it has been on the inferring regulatory network step for 5 days. My HPC cluster only allows a maximum time of 5 days, so the job was cancelled. Is there any way I can speed this up? 999 GB of memory is pretty crazy, and a more efficient method would be best.

What are your recommendations when running pyscenic on large datasets like this one?

Below is the CLI command I used to run the GRN step pyscenic grn -o '/cluster/projects/adj.csv' --num_workers 16 '/cluster/projects/counts.csv' '/cluster/projects/lambert2018.txt'

Thank you, Sayyam

Sayyam-Shah commented 2 years ago

Hello,

I am also experiencing the same issue with 70k cells but this time with 300 GB and 20 cores. It has been stuck on the inferring regulatory network step for five days.

2022-09-29 11:51:19,678 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.

2022-09-29 12:04:53,560 - pyscenic.cli.pyscenic - INFO - Inferring regulatory networks.

May you please provide a recommendation on how I can speed this up or how much resources do you suggest I use for these cell counts?

Thank you, Sayyam

hyjforesight commented 2 years ago

Hello @Sayyam-Shah Not developer. I believe that the GRNBOOST2 algorithm needs a revolution. It's low efficient now. In my case, 20000 cells with 20000 genes needs 150000 hours for running.

Sayyam-Shah commented 2 years ago

I agree. The algorithm is not memory efficient. I have solved the issue with an alternative. I created meta cells using the tool below and inputted my own feature genes to account for batch correction. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02667-1

I then ran the grn and cistarget step on the metacells. However, I ran aucell using the metacell regulons and the full original expression matrix.

I got pretty good results from that workflow.

Yakun-Pang commented 2 years ago

Hi, just for someone else experiencing the same issue with large database. I am running 80k+ cells and about 30k genes with 512 GB of memory and 34 CPUs on HPC. I followed the advice from https://pyscenic.readthedocs.io/en/latest/faq.html and tried arboreto_with_multiprocessing.py. unlike the pyscenic grn will stuck on the inferring regulatory network step without any information until it finished and ready for output file writing, arboreto_with_multiprocessing.py will give you the timeline and estimate total time.It still running but has finished 32% in 16 hours which is full of hope.

ghuls commented 1 year ago

With which version of pySCENIC was it? The arboreto step in 0.12.0 and higher will use the multiprocessing version by default now as dask seems to be too unreliable.

Yakun-Pang commented 1 year ago

With which version of pySCENIC was it? The arboreto step in 0.12.0 and higher will use the multiprocessing version by default now as dask seems to be too unreliable.

I am using 0.11.0. I got some issues with updated versions of pySCENIC. I got bus error with dask and arboreto also super super slow.

ATPs commented 3 weeks ago

I found thatit works better with intel CPU than AMD. Also, if the input file is larger than a certain number, it cannot make full use of the CPU computation power and takes much much longer time to finish. It is the problem of GRNBOOST2.