aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
400 stars 176 forks source link

GRNinference failure #139

Closed TatyanaLev closed 4 years ago

TatyanaLev commented 4 years ago

I am trying to run pyScenic through nextflow and singlarity on a pretty large data set (~4Gb loom file. ~64,000 cells by ~35,000 genes). I get errors here:

[1f/25a8ab] process > filter [100%] 1 of 1 ✔ [f5/a29a08] process > GRNinference [100%] 1 of 1, failed: 1 ✘ [17/27ac18] process > preprocess [100%] 1 of 1, failed: 1 WARN: Killing pending tasks (1) ERROR ~ Error executing process > 'GRNinference'

N E X T F L O W version 20.01.0 build 5264 Singularity image docker://aertslab/pyscenic:0.9.19

Below is snippet from another user's comment; seems that large data creates problems? Do you have suggestions on parallelizing / multithreading so that data this large can work?

.... using the image: aertslab-pySCENIC-0.9.18.img on an HPC. ... Curiously, this only happens when my data frame is large (~64k cells). A subset of the same data with only 3k cells exported works fine without any issue using the same setup. ... Originally posted by @FloWuenne in https://github.com/aertslab/pySCENIC/issues/108#issuecomment-581426235

cflerin commented 4 years ago

Hi @TatyanaLev

Could you add some additional information?

How did you create the loom file (Seurat/Scanpy/Loompy)?

What is the exact error you're getting? I would help if you paste the whole thing in a code block.

Can I also suggest that you update to the latest version of the nextflow pipeline, as there is an update that might fix for your GRNinference problems (run nextflow pull aertslab/SCENICprotocol).

TatyanaLev commented 4 years ago

Hi @cflerin, thanks for prompt response!

1) The loom file was made in Seurat 2) I updated the version and reran: a) 3,881 cells (subset with loompy) -- this ran successfully, but took ~24 hours on HPC b) full 61,571 cells original loom -- this crashed within a few hours:

Pipeline version: 0.2.0


Parameters in use: loom_input=NucSeq_batch_correct.loom loom_filtered=filtered.loom loom_output=fullDataOutput.loom thr_min_genes=200 thr_min_cells=3 thr_n_genes=5000 thr_pct_mito=0.25 outdir=output threads=6 TFs=allTFs_hg38.txt motifs=motifs-v9-nr.hgnc-m0.001-o0.0.tbl db=hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather pyscenic_output=pyscenic_output.loom grn=grnboost2 cell_id_attribute=CellID gene_attribute=Gene pyscenic_container=aertslab/pyscenic:0.9.19

Nextflow installation completed. Please note:


WARNING: only using a single feather database: /dfs3/pub/tzhuravl/scenic_with_hg38db/hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather. To include all database files using pattern matching, make sure the value for the '--db' parameter is enclosed in quotes!


[warm up] executor > local


Project name: SCENIC Protocol Project dir : /data/users/tzhuravl/.nextflow/assets/aertslab/SCENICprotocol Git info: https://github.com/aertslab/SCENICprotocol.git - master [53ee9b4050dce97e50de55a323e2a9ee543f1675] Cmd line: nextflow run aertslab/SCENICprotocol -profile singularity --loom_input NucSeq_batch_correct.loom --loom_output fullDataOutput.loom --TFs allTFs_hg38.txt --motifs motifs-v9-nr.hgnc-m0.001-o0.0.tbl --db hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr.feather Pipeline version: 0.2.0


Parameters in use: loom_input=NucSeq_batch_correct.loom loom_filtered=filtered.loom loom_output=fullDataOutput.loom thr_min_genes=200 thr_min_cells=3 thr_n_genes=5000 thr_pct_mito=0.25 outdir=output threads=6 TFs=allTFs_hg38.txt motifs=motifs-v9-nr.hgnc-m0.001-o0.0.tbl db=hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather pyscenic_output=pyscenic_output.loom grn=grnboost2 cell_id_attribute=CellID gene_attribute=Gene pyscenic_container=aertslab/pyscenic:0.9.19


WARNING: only using a single feather database: /dfs3/pub/tzhuravl/scenic_with_hg38db/hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather. To include all database files using pattern matching, make sure the value for the '--db' parameter is enclosed in quotes!


[warm up] executor > local Checking aertslab/SCENICprotocol ... Already-up-to-date - revision: 53ee9b4050 [master] executor > local (1) [12/71ff97] process > filter [ 0%] 0 of 1 N E X T F L O W ~ version 19.04.1 Launching aertslab/SCENICprotocol [astonishing_shaw] - revision: 53ee9b4050 [master]


Project name: SCENIC Protocol Project dir : /data/users/tzhuravl/.nextflow/assets/aertslab/SCENICprotocol Git info: https://github.com/aertslab/SCENICprotocol.git - master [53ee9b4050dce97e50de55a323e2a9ee543f1675] Cmd line: nextflow run aertslab/SCENICprotocol -profile singularity --loom_input NucSeq_batch_correct.loom --loom_output fullDataOutput.loom --TFs allTFs_hg38.txt --motifs motifs-v9-nr.hgnc-m0.001-o0.0.tbl --db hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr.feather Pipeline version: 0.2.0


Parameters in use: loom_input=NucSeq_batch_correct.loom loom_filtered=filtered.loom loom_output=fullDataOutput.loom thr_min_genes=200 thr_min_cells=3 thr_n_genes=5000 thr_pct_mito=0.25 outdir=output threads=6 TFs=allTFs_hg38.txt motifs=motifs-v9-nr.hgnc-m0.001-o0.0.tbl db=hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather pyscenic_output=pyscenic_output.loom grn=grnboost2 cell_id_attribute=CellID gene_attribute=Gene pyscenic_container=aertslab/pyscenic:0.9.19


WARNING: only using a single feather database: /dfs3/pub/tzhuravl/scenic_with_hg38db/hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather. To include all database files using pattern matching, make sure the value for the '--db' parameter is enclosed in quotes!


[warm up] executor > local executor > local (1) [9b/b2099f] process > filter [ 0%] 0 of 1 executor > local (1) [24/2352c6] process > filter [ 0%] 0 of 1 executor > local (1) [9b/caaf5d] process > filter [ 0%] 0 of 1

executor > local (1) [24/2352c6] process > filter [100%] 1 of 1 ✔

executor > local (1) [24/2352c6] process > filter [100%] 1 of 1 ✔ WARN: Access to undefined parameter seed -- Initialise it to a default value eg. params.seed = some_value

executor > local (2) [24/2352c6] process > filter [100%] 1 of 1 ✔ [a3/040210] process > GRNinference [ 0%] 0 of 1 WARN: Access to undefined parameter seed -- Initialise it to a default value eg. params.seed = some_value

executor > local (1) [9b/b2099f] process > filter [100%] 1 of 1 ✔

executor > local (1) [9b/b2099f] process > filter [100%] 1 of 1 ✔ WARN: Access to undefined parameter seed -- Initialise it to a default value eg. params.seed = some_value

executor > local (2) [9b/b2099f] process > filter [100%] 1 of 1 ✔ [f2/f2fc71] process > GRNinference [ 0%] 0 of 1 WARN: Access to undefined parameter seed -- Initialise it to a default value eg. params.seed = some_value

executor > local (1) [9b/caaf5d] process > filter [100%] 1 of 1 ✔

executor > local (1) [9b/caaf5d] process > filter [100%] 1 of 1 ✔ WARN: Access to undefined parameter seed -- Initialise it to a default value eg. params.seed = some_value

executor > local (2) [9b/caaf5d] process > filter [100%] 1 of 1 ✔ [f0/f7b2c9] process > GRNinference [ 0%] 0 of 1 WARN: Access to undefined parameter seed -- Initialise it to a default value eg. params.seed = some_value

executor > local (3) [24/2352c6] process > filter [100%] 1 of 1 ✔ [a3/040210] process > GRNinference [ 0%] 0 of 1 [a6/517360] process > preprocess [ 0%] 0 of 1

executor > local (3) [9b/b2099f] process > filter [100%] 1 of 1 ✔ [f2/f2fc71] process > GRNinference [ 0%] 0 of 1 [d7/9c1482] process > preprocess [ 0%] 0 of 1

executor > local (3) [9b/caaf5d] process > filter [100%] 1 of 1 ✔ [f0/f7b2c9] process > GRNinference [ 0%] 0 of 1 [3b/942f21] process > preprocess [ 0%] 0 of 1

executor > local (1) [12/71ff97] process > filter [100%] 1 of 1 ✔

executor > local (1) [12/71ff97] process > filter [100%] 1 of 1 ✔ WARN: Access to undefined parameter seed -- Initialise it to a default value eg. params.seed = some_value

executor > local (2) [12/71ff97] process > filter [100%] 1 of 1 ✔ [9c/af9b55] process > GRNinference [ 0%] 0 of 1 WARN: Access to undefined parameter seed -- Initialise it to a default value eg. params.seed = some_value

executor > local (3) [12/71ff97] process > filter [100%] 1 of 1 ✔ [9c/af9b55] process > GRNinference [ 0%] 0 of 1 [6e/0e1786] process > preprocess [ 0%] 0 of 1

executor > local (3) [9b/b2099f] process > filter [100%] 1 of 1 ✔ [f2/f2fc71] process > GRNinference [ 0%] 0 of 1 [d7/9c1482] process > preprocess [ 0%] 0 of 1 ERROR ~ Error executing process > 'preprocess'

Caused by: Process preprocess terminated with an error exit status (1)

Command executed:

preprocess_visualize_project_scanpy.py preprocess --loom_filtered filtered.loom --anndata anndata.h5ad --threads 6

Command exit status: 1

Command output: (empty)

Command error: /opt/venv/lib/python3.7/site-packages/dask/config.py:161: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. data = yaml.load(f.read()) or {} /opt/venv/lib/python3.7/site-packages/louvain/Optimiser.py:349: SyntaxWarning: assertion is always true, perhaps remove parentheses? assert(issubclass(partition_type, LinearResolutionParameterVertexPartition), scanpy==1.4.4.post1 anndata==0.6.22.post1 umap==0.3.10 numpy==1.17.2 scipy==1.3.1 pandas==0.25.1 scikit-learn==0.21.3 statsmodels==0.10.1 python-igraph==0.7.1 louvain==0.6.1 Traceback (most recent call last): File "/data/users/tzhuravl/.nextflow/assets/aertslab/SCENICprotocol/bin/preprocess_visualize_project_scanpy.py", line 134, in func( args ) File "/data/users/tzhuravl/.nextflow/assets/aertslab/SCENICprotocol/bin/preprocess_visualize_project_scanpy.py", line 37, in preprocess sc.pp.regress_out(adata, ['n_counts', 'percent_mito'], n_jobs=args.threads) File "/opt/venv/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py", line 799, in regress_out pool = multiprocessing.Pool(n_jobs) File "/usr/local/lib/python3.7/multiprocessing/context.py", line 119, in Pool context=self.get_context()) File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 176, in init self._repopulate_pool() File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool w.start() File "/usr/local/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/usr/local/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/usr/local/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/usr/local/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

Work dir: /dfs3/pub/tzhuravl/scenic_with_hg38db/work/d7/9c14820e3cd05e0a46a77dbcb6bbb1

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check '.nextflow.log' file for details

executor > local (3) [9b/b2099f] process > filter [100%] 1 of 1 ✔ [f2/f2fc71] process > GRNinference [ 0%] 0 of 1 [d7/9c1482] process > preprocess [100%] 1 of 1, failed: 1 ✘ ERROR ~ Error executing process > 'preprocess'

Caused by: Process preprocess terminated with an error exit status (1)

Command executed:

preprocess_visualize_project_scanpy.py preprocess --loom_filtered filtered.loom --anndata anndata.h5ad --threads 6

Command exit status: 1

Command output: (empty)

Command error: /opt/venv/lib/python3.7/site-packages/dask/config.py:161: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. data = yaml.load(f.read()) or {} /opt/venv/lib/python3.7/site-packages/louvain/Optimiser.py:349: SyntaxWarning: assertion is always true, perhaps remove parentheses? assert(issubclass(partition_type, LinearResolutionParameterVertexPartition), scanpy==1.4.4.post1 anndata==0.6.22.post1 umap==0.3.10 numpy==1.17.2 scipy==1.3.1 pandas==0.25.1 scikit-learn==0.21.3 statsmodels==0.10.1 python-igraph==0.7.1 louvain==0.6.1 Traceback (most recent call last): File "/data/users/tzhuravl/.nextflow/assets/aertslab/SCENICprotocol/bin/preprocess_visualize_project_scanpy.py", line 134, in func( args ) File "/data/users/tzhuravl/.nextflow/assets/aertslab/SCENICprotocol/bin/preprocess_visualize_project_scanpy.py", line 37, in preprocess sc.pp.regress_out(adata, ['n_counts', 'percent_mito'], n_jobs=args.threads) File "/opt/venv/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py", line 799, in regress_out pool = multiprocessing.Pool(n_jobs) File "/usr/local/lib/python3.7/multiprocessing/context.py", line 119, in Pool context=self.get_context()) File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 176, in init self._repopulate_pool() File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool w.start() File "/usr/local/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/usr/local/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/usr/local/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/usr/local/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

Work dir: /dfs3/pub/tzhuravl/scenic_with_hg38db/work/d7/9c14820e3cd05e0a46a77dbcb6bbb1

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check '.nextflow.log' file for details

executor > local (3) [9b/b2099f] process > filter [100%] 1 of 1 ✔ [f2/f2fc71] process > GRNinference [ 0%] 0 of 1 [d7/9c1482] process > preprocess [100%] 1 of 1, failed: 1 ✘ WARN: Killing pending tasks (1) ERROR ~ Error executing process > 'preprocess'

Caused by: Process preprocess terminated with an error exit status (1)

Command executed:

preprocess_visualize_project_scanpy.py preprocess --loom_filtered filtered.loom --anndata anndata.h5ad --threads 6

Command exit status: 1

Command output: (empty)

Command error: /opt/venv/lib/python3.7/site-packages/dask/config.py:161: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. data = yaml.load(f.read()) or {} /opt/venv/lib/python3.7/site-packages/louvain/Optimiser.py:349: SyntaxWarning: assertion is always true, perhaps remove parentheses? assert(issubclass(partition_type, LinearResolutionParameterVertexPartition), scanpy==1.4.4.post1 anndata==0.6.22.post1 umap==0.3.10 numpy==1.17.2 scipy==1.3.1 pandas==0.25.1 scikit-learn==0.21.3 statsmodels==0.10.1 python-igraph==0.7.1 louvain==0.6.1 Traceback (most recent call last): File "/data/users/tzhuravl/.nextflow/assets/aertslab/SCENICprotocol/bin/preprocess_visualize_project_scanpy.py", line 134, in func( args ) File "/data/users/tzhuravl/.nextflow/assets/aertslab/SCENICprotocol/bin/preprocess_visualize_project_scanpy.py", line 37, in preprocess sc.pp.regress_out(adata, ['n_counts', 'percent_mito'], n_jobs=args.threads) File "/opt/venv/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py", line 799, in regress_out pool = multiprocessing.Pool(n_jobs) File "/usr/local/lib/python3.7/multiprocessing/context.py", line 119, in Pool context=self.get_context()) File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 176, in init self._repopulate_pool() File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool w.start() File "/usr/local/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/usr/local/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/usr/local/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/usr/local/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

Work dir: /dfs3/pub/tzhuravl/scenic_with_hg38db/work/d7/9c14820e3cd05e0a46a77dbcb6bbb1

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check '.nextflow.log' file for details

cflerin commented 4 years ago

Hi @TatyanaLev , thanks for pasting the log... From this:

OSError: [Errno 12] Cannot allocate memory

it seems you're running out of memory? How much does your machine have available? This is before you even get to the GRN inference step though, at the regressing out step in preprocess. I'm not sure if the Seurat-created loom could be causing another issue though.

TatyanaLev commented 4 years ago

Hi @cflerin,

HPC admins say that I can specify the amount of memory in the script submission. Different hosts/nodes are different. What amount do you suggest trying? I can submit another job.

Meanwhile over the weekend, I had submitted another job requesting 6 cores (but no changes to the parameters for nextflow/singularity. Should I have updated number of workers or threads maybe?). Anyway, the process has gotten farther, but has been at this step for ~40 hours (~50 hours since job submission).

executor > local (6) [30/f5dbfc] process > filter [100%] 1 of 1 ✔ [ed/76c738] process > GRNinference [ 0%] 0 of 1 [6f/5d1d58] process > preprocess [100%] 1 of 1 ✔ [79/f3f630] process > pca [100%] 1 of 1 ✔ [fc/7b853f] process > visualize [100%] 1 of 1 ✔ [b5/4d6aab] process > cluster [100%] 1 of 1 ✔

When I look at the .nextflow.log it says the following:

Version: 19.04.1 build 5072 Modified: 03-05-2019 12:29 UTC (05:29 PDT) System: Linux 3.10.107-1.el6.elrepo.x86_64 Runtime: Groovy 2.5.6 on OpenJDK 64-Bit Server VM 11.0.2+9 Encoding: UTF-8 (UTF-8) Process: 15470@compute-2-4.local [10.1.255.228]
CPUs: 6 - Mem: 504.9 GB (484.3 GB) - Swap: 4.9 GB (4.9 GB)

...Last update:

Feb-17 14:04:29.920 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- pending tasks are shown below ~> TaskHandler[id: 3; name: GRNinference; status: RUNNING; exit: -; error: -; workDir: /dfs3/pub/tzhuravl/scenic_with_hg38db/work/ed/76c73851df5427916d76df25f1237f]

cflerin commented 4 years ago

Ok great, so you made it past the initial memory error and it looks like GRNBoost2 is running. So it seems that the loom file is not the issue. But using only 6 cores with 60k cells/35k genes will take quite a while. I would consider increasing the number of cores to as many as your cluster will allow (by passing --threads 36 for instance to the nextflow command). If you're able to use the full 500GB on that node I would use as many cores as there are available. For reference, I recently ran a dataset with 100k cells and 15k genes using 35 processes and it took around 40 hours for the GRNBoost2 step alone.

Another point is that you're passing two feather databases to the --db parameter, but nextflow is only keeping the first one (this is for the cisTarget step after GRNBoost2). Instead you should point to a path and use globbing to select both databases (make sure the path is in double quotes). For example:

--db "/path/to/featherdbs/*feather"
TatyanaLev commented 4 years ago

Thanks! I will work with our HPC admins to see how to get the maximum GBs and submit again. To clarify, are both feather databases required? Both 10kb up/down and 500bp up/down?

cflerin commented 4 years ago

Hi @TatyanaLev ,

No, it's not required to use both databases, it's entirely up to you. I typically use the hg38 databases: hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather and hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather