aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

pyscenic took a long time and crashed when running on a laptop machine #15

Closed dnguyen179 closed 6 years ago

dnguyen179 commented 6 years ago

Hello, I am implementing pyscenic on my laptop (MacPro 2.6 GHz Intel Core i7) using the command as followed: pyscenic grnboost -o scenic_out/grn_output.tsv @grn_args.txt in which grn_args.txt contains filenames for expression data file and known TFs file

The message that I got from stdout is: computing dask graph /Users/ngurb2/anaconda3/lib/python3.6/site-packages/distributed/worker.py:742: UserWarning: Large object of size 1.17 MB detected in task graph: (["('from-delayed-88e08b64b679c247402263c72630abb5 ... b5', 19971)"],) Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers

future = client.submit(func, big_data) # bad

big_future = client.scatter(big_data)     # good
future = client.submit(func, big_future)  # good

% (format_bytes(len(b)), s))

The process halted and crashed and I do not know what the issues are here. It might be something with the parallel computing? Thank you!

bramvds commented 6 years ago

Dear,

You should preferably run the first and second steps of (py)SCENIC on a Linux box instead of running this on your own laptop. However I did run these steps myself on my laptop (i7 "8 cores") and did not run into these problems. I used the following statements in a Jupyter notebook

import pandas as pd
import os
from arboretum.utils import load_tf_names
from arboretum.algo import grnboost2

RESOURCES_FOLDER="."
MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'mm_tfs.txt')
SC_EXP_FNAME = os.path.join(RESOURCES_FOLDER, "GSE60361_C1-3005-Expression.txt")

ex_matrix = pd.read_csv(SC_EXP_FNAME, sep='\t', header=0, index_col=0).T
tf_names = load_tf_names(MM_TFS_FNAME)
adjancencies = grnboost2(expression_data=ex_matrix, tf_names=tf_names, verbose=True)
adjancencies.head()

Could you try these statements in a notebook?

Kindest regards, Bram

dnguyen179 commented 6 years ago

I ran the above codes and there were issues with distributed package and workers, leading to multiple worker processes were killed by unknown signal. It seems to be the multi-processing issue here. This is an example of the error output:

RuntimeError:
An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

distributed.nanny - WARNING - Worker process 98717 was killed by unknown signal

ghuls commented 6 years ago

@dnguyen179 Maybe you are running out of memory.

dnguyen179 commented 6 years ago

I also ran pyscenic grnboost on a cluster (Linux system) and this is the error that I got:

/usr/local/python/3.5.3/lib/python3.5/site-packages/distributed/worker.py:742: UserWarning: Large object of size 1.17 MB detected in task graph:

(["('from-delayed-88e08b64b679c247402263c72630abb5 ... b5', 19971)"],)

Consider scattering large objects ahead of time

with client.scatter to reduce scheduler burden and

keep data on workers

future = client.submit(func, big_data)    # bad

big_future = client.scatter(big_data)     # good

future = client.submit(func, big_future)  # good

% (format_bytes(len(b)), s))

Future exception was never retrieved

future: <Future finished exception=CommClosedError('in : Stream is closed',)>

bramvds commented 6 years ago

Hi,

First of all regarding your message:

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

Please have a look at issue 13. If you run the code snippet from a script your should wrap it in a function and call if from your script using:

if __name__ == "__main__":
    function()

The other option would be running in a Jupyter notebook.

Regarding your issue when running grnboost on a Linux node: I'm relying on the package arboreto for the grnboost phase of pySCENIC. Could you post your issue there? Could you quickly check that the expression matrix you supply to pyscenic is properly structured (i.e. cell_id as row index and gene symbols as column index)?

For phase II of the SCENIC pipeline you can use the keyword argument client_or_address='dask_multiprocessing' when invoking the prune method of you choice. This avoids making use of the distributed dask scheduler altogether.

Kindest regards, Bram

dnguyen179 commented 6 years ago

Hello,

Thank you, I was able to run the grnboost on a hpc cluster. However, I am running into memory issues for ctx and determining the regulons for GSE60361 dataset. I am allocating 12 processors and 64GB for the process but it exceeded the memory limit every time. For the database file, I am providing all mm9-*.feather files and was wondering if one file at a time would be better in this case, since there is no instruction on how many files should be used as the database.

Thank you, Diep

bramvds commented 6 years ago

Dear Diep,

I would restrict the databases to the ones used in the pySCENIC paper, i.e. (1) 500bp upstream for 7 species and (2) 10kb entered around TSS again for 7 species.

Make sure to set the keyword argument client_or_address for the prune method to 'dask_multiprocessing'.

If you still run into memory issues you can decrease the rank_threshold to reduce the memory need (but not below [AUC threshold * number of genes in the database/genome] - you will get an error if you do this).

Hope this helps, Bram

dnguyen179 commented 6 years ago

Dear Bram,

I am still running into memory issues even though I did set 'dask_multiprocessing' as an argument. Some warnings came up when I ran it, as followed: /usr/local/python/3.5.3/lib/python3.5/site-packages/pyscenic/utils.py:138: RuntimeWarning: invalid value encountered in greater regulations = (rhos > rho_threshold).astype(int) - (rhos < -rho_threshold).astype(int) /usr/local/python/3.5.3/lib/python3.5/site-packages/pyscenic/utils.py:138: RuntimeWarning: invalid value encountered in less regulations = (rhos > rho_threshold).astype(int) - (rhos < -rho_threshold).astype(int)

The version I'm running is the newest pySCENIC that was pulled from Github today, on 16 processors and with 128GB memory allocated. Do you have any suggestions in this case?

Thank you, Diep

bramvds commented 6 years ago

Dear Diep,

You can safely ignore these warnings.

Regarding the problem with excessive memory usage, could you confirm that you run into this issue during the cisTarget step (i.e. pruning of indirect targets from the modules derived from GENIE3)? This question is easily answered because while pruning you should get a progress bar. Do you get this out of memory error immediately or rather at the end of this phase?

Could you also give me an idea of the number of modules that are derived from GENIE3 (i.e. you can assess this by running modules = list(modules_from_adjacencies(adjacencies, ex_matrix)) directly in a script or in a Jupyter notebook - See notebook)?

Kindest regards, Bram

dnguyen179 commented 6 years ago

Dear Bram,

Yes, the memory issue is during cisTarget step. The progress bar is usually at 0-1% when the program exceeds allocated memory slot but it is definitely in the pruning step since the output includes: "Less than 80% of genes ... Skipping this module"

The number of modules by running modules = list(modules_from_adjacencies(adjacencies, ex_matrix)) is 8662 (derived from grnboost2). This is again the sample dataset GSE60361.

Thank you so much, Diep

bramvds commented 6 years ago

Dear Diep,

This is strange. To better pinpoint the problem could you run the prune/cisTarget step with client_or_address='custom_multiprocessing'? This will not use the dask framework. You will not get a progress bar so be patient and the memory needs are in this case mainly set by the number of processors (i.e. size of database [1.1Gb] * n_cpus).

Kr, Bram

dnguyen179 commented 6 years ago

Dear Bram,

I ran it again, and still exceeded memory limit. To confirm, I am using this command in my job script to run pySCENIC on HPC: pyscenic ctx -o scenic_out/ctx_output_GSE60363_June14.csv --annotations_fname resources/motifs-v9-nr.mgi-m0.001-o0.0.tbl.txt --client_or_address custom_multiprocessing --expression_mtx_fname resources/transGSE60361_C1-3005-Expression.tsv @ctx_args.txt One thing to note is that even though I specified --client_or_address custom_multiprocessing, I still got a progress bar and the program stopped at 8% this time.

Thank you, Diep

bramvds commented 6 years ago

If running from the command line you should used the option --mode custom_multiprocessing instead of client_or_address=custom_multiprocessing. The latter only works as a keyword argument to the functions prune2df or prune part of the prune module in pyscenic.

dnguyen179 commented 6 years ago

Dear Bram,

When I allocated 250gb, it seemed to be working. However, one thing I noticed is this error in my log file:

Unable to process "Regulon for Foxj2" on database "mm9-tss-centered-10kb-7species" because ran out of memory. Stacktrace: Traceback (most recent call last): File "/usr/local/python/3.5.3/lib/python3.5/site-packages/pyscenic/transform.py", line 185, in module2df weighted_recovery=weighted_recovery) File "/usr/local/python/3.5.3/lib/python3.5/site-packages/pyscenic/transform.py", line 157, in module2features_auc1stimpl rccs, = recovery(df, db.total_genes, weights, rank_threshold, auc_threshold, no_auc=True) File "/usr/local/python/3.5.3/lib/python3.5/site-packages/pyscenic/recovery.py", line 80, in recovery rccs = rcc2d(rankings, weights, rank_threshold) File "/usr/local/python/3.5.3/lib/python3.5/site-packages/pyscenic/recovery.py", line 52, in rcc2d rccs = np.empty(shape=(n_features, rank_threshold)) # Pre-allocation. MemoryError

Another thing I want to ask is the cisTarget database. I would like to construct one from ATAC-seq and I was wondering what are the dimension, row/column names and values in the data matrix (I figured columns to be the genes as I read the feather files in R).

Thank you, Diep

dnguyen179 commented 6 years ago

Dear Bram,

Memory issue does not occur when I run cisTarget part with 250gb and 'custom_multiprocessing'. However, the process cannot terminate. From the log output, something like below: ... 2018-06-21 15:44:20,838 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(13): All regulons derived. 2018-06-21 15:44:20,861 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(13): Done. 2018-06-21 15:45:19,533 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(4): All regulons derived. 2018-06-21 15:45:19,603 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(4): Done. 2018-06-21 15:51:33,154 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(1): All regulons derived. 2018-06-21 15:51:34,133 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(1): Done. 2018-06-21 15:54:28,245 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(3): All regulons derived. 2018-06-21 15:54:28,746 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(3): Done. 2018-06-21 15:56:09,250 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(10): All regulons derived. 2018-06-21 15:56:09,296 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(10): Done. 2018-06-21 15:58:06,579 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(7): All regulons derived. 2018-06-21 15:58:07,074 - pyscenic.prune - INFO - Worker mm10refseq-r8010kb_up_and_down_tss(7): Done.

Even though all workers have finished pruning, the program is unable to output to file. The memory error that I mentioned previously still happens but the process is still able to proceed to the next regulons/workers. Did you run into this issue before?

Thank you, Diep

bramvds commented 6 years ago

Dear Diep,

The default implementation is the dask framework so I would keep on using 'dask_multiprocessing' ignoring the warnings (keep in mind in your subsequent analysis that these modules are now excluded/not taken into account - however this does not mean that the TF for that module will not appear in the results as a TF is represented by several modules).

I find it strange that you run into memory issues on such a small dataset (i.e. the mouse brain dataset from Zeisel et al.). I was able to run the cisTarget/prune step on my laptop (a MBP i7 with 16Gb) using both implementations with no issues at all.

Kindest regards, Bram

dnguyen179 commented 6 years ago

Dear Bram,

I am actually testing my own dataset with ~3000 genes and ~5000 cells. Memory is not an issue in this case. The program stopped during cisTarget/prune step and was unable to output any results even though all workers were done. The database I am using is the mm10refseq-r8010kb_up_and_down_tss.mc9nr.feather

Have you run into this problem before?

Diep

bramvds commented 6 years ago

Dear Diep,

I presume that you are now using the 'custom_multiprocessing' option for running the cisTarget step? I never ran into this problem myself. However somebody in the lab has experienced this once in the past. I suggest the following approach: I'll create a new version of pySCENIC with considerable more logging, i.e. the unix process ids involved and the temporary files being created. This will enable you to inspect the intermediate results created by the already finished child processes and hopefully I'll be able to pinpoint the problem.

To be honest I mostly rely on the dask-based implementation. In this implementation I never ran into a problem like you mentioned. Could you give this implementation a try?

Kindest regards, Bram

dnguyen179 commented 6 years ago

Dear Bram,

I am using dask_multiprocessing now and it is working fine on my local machine. I suspect there are issues between dask and the cluster that I had been running the program on.

Thank you for your assistance, Diep

bramvds commented 6 years ago

I'm glad to hear that it works fine on your laptop.

Kr, Bram