aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
431 stars 179 forks source link

prune2df runing for more than 140h #142

Open JPcerapio opened 4 years ago

JPcerapio commented 4 years ago

Hello, so I managed to get until the Phase II of your Tutorial with your data.

But after running 145h I stopped the process. I don't know if it is normal that it runs that long.

Thanks for your help.

Jp

Here some info,

dbs [FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species.mc9nr"), FeatherRankingDatabase(name="mm9-500bp-upstream-10species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr")]

PHASE I network = grnboost2(expression_data=ex_matrix2, gene_names=gene_names, tf_names=tf_names)#6h of runing

modules = list(modules_from_adjacencies(network, ex_matrix))

PHASE II

with ProgressBar(): df = prune2df(dbs, modules, "/home/user/pySCENIC/data_bases/Mm/motifs-v9-nr.mgi-m0.001-o0.0.tbl")

[####################################### ] | 98% Completed | 25min 37.2s 2020-02-12 15:05:46,854 - pyscenic.transform - WARNING - Less than 80% of the genes in Tcf21 could be mapped to mm9-tss-centered-5kb-10species.mc9nr. Skipping this module. [####################################### ] | 98% Completed | 25min 45.6s 2020-02-12 15:05:55,227 - pyscenic.transform - WARNING - Less than 80% of the genes in Mef2d could be mapped to mm9-tss-centered-5kb-10species.mc9nr. Skipping this module. [####################################### ] | 98% Completed | 25min 46.4s 2020-02-12 15:05:56,007 - pyscenic.transform - WARNING - Less than 80% of the genes in Meox2 could be mapped to mm9-tss-centered-5kb-10species.mc9nr. Skipping this module. [####################################### ] | 99% Completed | 18hr 5min 13.5s [####################################### ] | 99% Completed | 145hr 0min 28.9s^CProcess ForkPoolWorker-446:

jk86754 commented 4 years ago

I having a similar issue. The progress bar creeps up relatively fast to a point and subsequently stalls. No error message but no output either.

Happened both on Linux and Anaconda on Windows.

JPcerapio commented 4 years ago

Hello @jk86754 , but did you let it end?, because i have to stopped it, I think 145h is quite a lot for a small set of samples.

Jp

cflerin commented 4 years ago

Hi @JPcerapio , @jk86754 ,

This step should definitely not take 145 hours. This seems to be a bug in the pruning step, similar to #104 . Running this step via the CLI seems to have worked for others, could you try this?

JPcerapio commented 4 years ago

Hey @cflerin thanks for your answer, I will try it but with this option the problem is that we do not have access to intermediates files or results that we will like to have.

I don't know if someone figure out if the error is coming from a some missing dependence or library.

Jp

cflerin commented 4 years ago

Hi, @JPcerapio , which intermediate files are you referring to? When you run this step in the CLI, you can still get the motif and regulon information. Although the CLI outputs only one of these, you can convert to the other without re-running, for example: #100

morganee261 commented 4 years ago

hello, I am using the CLI of pyscenic and "creating regulons" has been running for over a week. 2020-04-06 09:15:03,025 - pyscenic.cli.pyscenic - INFO - Calculating regulons. My data set is quite big (69,000 cells and 27,000 genes) but I am running on a cluster with 64 cores and 1Tb of RAM.

thanks for your help, morgane

liboxun commented 4 years ago

Hi @morganee261 ,

Have you solved this problem? I'm also running the CLI (pyscenic ctx) and it's taking a long time.

Thanks, Boxun

morganee261 commented 4 years ago

Hi @liboxun,

Unfortunately no, I haven't had any luck. It has been (and is still) running for a month now and I did not get an answer from the developers of this package. thanks,

Morgane

cflerin commented 4 years ago

Hi @morganee261 , @liboxun ,

This step should definitely not take this long. If it's been running for a month there's clearly something wrong and I would stop it.

I've seen this issue a few times before, but I haven't been able to reproduce the problem to see where and why this step hangs, so I can't offer you a good solution. A few suggestions:

liboxun commented 4 years ago

Thanks a lot @cflerin ! Since I'm already running the CLI version, I'll try switching to the Docker image or using just a single feather database.

I'll update this here when things come out.

morganee261 commented 4 years ago

Hello @cflerin,

I have been running the CLI of pyscenic ctx and that is what got stuck running for over a month. I stopped and I started running it with a single feather database.

I am also trying to run the docker image but I am not very familiar with it and I run into an error :

docker run -it --rm \

-v /home/Morgane/mapping/int:/scenicdata \
aertslab/pyscenic:[version] pyscenic grn \
    --num_workers 20 \
    -o /scenicdata/expr_mat.adjacencies.tsv \
    /scenicdata/ex_matrix.csv \
    /scenicdata/hgnc_tfs.txt

docker: invalid reference format. See 'docker run --help'.

could you please advise?

thanks, for your reply and your help,

Morgane

liboxun commented 4 years ago

Hi @cflerin ,

I went back to ran the CLI with a single feather database, and it didn't help. It still got stuck forever at:

2020-05-08 15:14:50,014 - pyscenic.utils - INFO - Creating modules.

2020-05-08 15:16:46,513 - pyscenic.cli.pyscenic - INFO - Loading databases.

2020-05-08 15:16:46,515 - pyscenic.cli.pyscenic - INFO - Calculating regulons. slurmstepd: error: JOB 1697596 ON NucleusA007 CANCELLED AT 2020-05-10T15:14:05 DUE TO TIME LIMIT

But when I tried using Singularity image (since Docker isn't available on our HPC system) of pySCENIC 0.10.0, it certainly helped. Now I actually got an progress bar, despite its failing at 57%:

[###################### ] | 57% Completed | 3hr 9min 9.2s

It failed because of it ran out of memory:

2020-05-08 21:23:27,584 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF165 could be mapped to hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.

2020-05-08 21:24:06,771 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF2 could be mapped to hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.

2020-05-08 21:47:28,929 - pyscenic.transform - ERROR - Unable to process "Regulon for NFKB1" on database "hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr" because ran out of memory. Stacktrace:

2020-05-08 21:47:31,092 - pyscenic.transform - ERROR - Unable to process "Regulon for ZNF81" on database "hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr" because ran out of memory. Stacktrace:

2020-05-08 21:47:51,126 - pyscenic.transform - ERROR - Traceback (most recent call last): File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 185, in module2df weighted_recovery=weighted_recovery) File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 159, in module2features_auc1st_impl avg2stdrcc = avgrcc + 2.0 * rccs.std(axis=0) File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 217, in _std keepdims=keepdims) File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 193, in _var x = asanyarray(arr - arrmean) MemoryError: Unable to allocate array with shape (24453, 5000) and data type float64

2020-05-08 21:47:51,441 - pyscenic.transform - ERROR - Traceback (most recent call last): File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 185, in module2df weighted_recovery=weighted_recovery) File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 159, in module2features_auc1st_impl avg2stdrcc = avgrcc + 2.0 * rccs.std(axis=0) File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 217, in _std keepdims=keepdims) File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 193, in _var x = asanyarray(arr - arrmean) MemoryError: Unable to allocate array with shape (24453, 5000) and data type float64

Bus error

I used a node with 32GB memory, with 32 workers. Is that too little? What would you recommend?

Thanks! Boxun

morganee261 commented 4 years ago

Hi @liboxun,

I got it to run in less than 14 min by using the docker image. I used 20 cores so the more the better I think. But here is my code (note that the whole code is in 1 line without "\", the code that is on the tutorial did not work for me)

sudo docker pull aertslab/pyscenic:0.10.0

sudo docker run -it --rm -v /path/to/data:/scenicdata aertslab/pyscenic:0.10.0 pyscenic grn --num_workers 20 --transpose -o /scenicdata/expr_mat.adjacencies.tsv /scenicdata/ex_matrix.csv /scenicdata/hgnc_tfs.txt

I have to transpose my expression matrix to get it in the right format but you might not have to

sudo docker run -it --rm -v /path/to/data:/scenicdata aertslab/pyscenic:0.10.0 pyscenic ctx scenicdata/expr_mat.adjacencies.tsv /scenicdata/hg19-tss-centered-10kb-7species.mc9nr.feather /scenicdata/hg19-500bp-upstream-7species.mc9nr.feather --annotations_fname /scenicdata/motifs-v9-nr.hgnc-m0.001-o0.0.tbl --expression_mtx_fname /scenicdata/ex_matrix.csv --transpose --mode "dask_multiprocessing" --output /scenicdata/regulons.csv --num_workers 20

this ran is 14 min on a server with 1Tb of RAM and using 20 out of 64 cores

sudo docker run -it --rm -v /path/to/data:/scenicdata aertslab/pyscenic:0.10.0 pyscenic aucell /scenicdata/ex_matrix.csv --transpose /scenicdata/regulons.csv -o /scenicdata/auc_mtx.csv --num_workers 20

this took less than 10 min

hope this helps!

morgane

liboxun commented 4 years ago

Hi @morganee261 ,

Thanks for that tip! Glad to hear it eventually worked for you.

I also got it to run (~23min) when I bumped the task over to a node with 128GB of memory (using 32 out of 32 cores).

Best, Boxun

morganee261 commented 4 years ago

Hi @cflerin

I am trying to import the results of the CLI pyscenic (3 csv files) into R for further analysis but I am having a lot of problems.

it seems like having a loom file for the importation helps however your CLI tutorial exports as csv.

could you please provide a brief tutorial on how to import them into R to be able to run the rest of the SCENIC script and look at the data?

thanks for your help,

Morgane

morganee261 commented 4 years ago

Hi @liboxun

I am having issues with the downstream analysis. I was wondering what platform you were using and if you had any luck with it. I have imported a loom file into R but the format is very different from the tutorial.

Thanks, Morgane

liboxun commented 4 years ago

Hi @morganee261 ,

I use Python. I haven't done any downstream analysis yet. I'll let you know how it goes in the next couple of weeks.

Best of luck, Boxun

liboxun commented 4 years ago

Hi @morganee261 ,

I was able to run the example jupyter notebook successfully for 10x PBMC dataset:

https://github.com/aertslab/SCENICprotocol/blob/master/notebooks/PBMC10k_downstream-analysis.ipynb

This notebook was written in Python, and was meant for analysis downstream of pyscenic grn and pyscenic ctx (i.e. after you generate adj.tsv and regulons.csv).

While there were several issues (some were due to wrong versions of dependencies, which thankfully were easy enough for me to fix by myself), I could largely run through the notebook smoothly.

Hopefully this helps! I'm not sure if there's an equivalent example in R, but I'd assume there is, since the original SCENIC was written in R.

Best, Boxun

ureyandy2009 commented 4 years ago

Hi @morganee261 ,

Thanks for that tip! Glad to hear it eventually worked for you.

I also got it to run (~23min) when I bumped the task over to a node with 128GB of memory (using 32 out of 32 cores).

Best, Boxun

Hi @liboxun, I met the same problem with you. The progress bar creeps up relatively fast to a 97% and subsequently stalls there. No error message but no output either. I noticed my 64G RAM was runout and no RAM was released. It seems that there was a bug eat all the memory. Could you kindly tell me how did you finally work it out? Did you use the docker image?Use only one feather? or just jump to a powerful computer? By the way, could you tell me the version you used? like python, cli, jupyter, and so on.

Many thanks.

Weijian

liboxun commented 4 years ago

Hi @ureyandy2009 ,

For me, a combination of two changes worked:

  1. I switched from CLI to Singularity image (Docker image should work the same way);
  2. I used a computer with 128GB RAM instead of 32GB.

Hopefully this helps!

Best, Boxun

ureyandy2009 commented 4 years ago

Hi @ureyandy2009 ,

For me, a combination of two changes worked:

1. I switched from CLI to Singularity image (Docker image should work the same way);

2. I used a computer with 128GB RAM instead of 32GB.

Hopefully this helps!

Best, Boxun

Thank you very much.

I think RAM maybe the main problem. In my case (24 processors with 4.2GHZ and 64G RAM), one feather costs about 40G RAM, so the computer shut down when i used 2 feathers at the same time. And this problem was solved when i used only one feather, which cost 40GB/64GB. And then the prune2df run less than 10 min.

Many thanks.

naila53 commented 3 years ago

I have faced the same issue recently and spent 3days trying to figure it out. Singularity build would't run for me on my institute's HPC, i kept getting this error : ERROR: You must install squashfs-tools to build images ABORT: Aborting with RETVAL=255

conda installation of squashfs-tools didn't work and needed system wide installation which was a hassle so didn't do it. what worked for me is the following :

my data set: 14766 cells × 23011 genes

1- specified an interactive session; srun --time=20:00:00 --partition=upgrade --nodes=1 --ntasks=1 --mem=128G --cpus-per-task=40 --pty /bin/bash -l

2- acitave conda environemt where pyscenic is installed.

3-run this script: everything is the same as in the tutorial page script: https://pyscenic.readthedocs.io/en/latest/tutorial.html

I just added: from dask.distributed import Client, LocalCluster

if __name__ == '__main__': adata=ad.read_loom('adata.all.pocessed.loom') ex_matrix=adata.to_df()

tf_names = load_tf_names(MM_TFS_FNAME)
db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
    return os.path.splitext(os.path.basename(fname))[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]`

adjacencies=pd.read_csv("net2.tsv", index_col=False, sep='\t')
modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

# Calculate a list of enriched motifs and the corresponding target genes for all modules.
with ProgressBar():
    df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME, client_or_address=Client(LocalCluster()))
# Create regulons from this table of enriched motifs.
regulons = df2regulons(df)

# Save the enriched motifs and the discovered regulons to disk.
df.to_csv(MOTIFS_FNAME)
with open(REGULONS_FNAME, "wb") as f:
    pickle.dump(regulons, f)

total consumed time:50minutes