aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

List modules command not finishing #76

Closed gouinK closed 5 years ago

gouinK commented 5 years ago

Greetings, I have run this package successfully on some of my datasets in the past, but I am rerunning it with pyscenic v0.9.9 (python 3.7) now and I am having an issue where the modules = list(modules_from_adjacencies(adjacencies, ex_matrix)) function never finishes (i.e. it has been running for about 3 days and seems to be stalled at 4.30G memory usage). The grnboost2 function runs correctly and seems to complete, it outputs this:

<class 'pandas.core.frame.DataFrame'> Index: 6000 entries, h1C_TGTGGTAAGGGCTTCC to h6A_ATCCGAATCCAACCAA Columns: 16607 entries, WASH9P to MYH1 dtypes: float64(16607) memory usage: 760.3+ MB preparing dask client parsing input creating dask graph 4 partitions computing dask graph shutting down client and local cluster finished

Are there any parameters that I could check that might be causing this? Thanks for all of your help!

cflerin commented 5 years ago

Hello,

Something looks a bit strange with your adjacencies matrix. Can you post the first few lines? It should look something like:

    TF  target  importance
0   TBX21   NKG7    219.412288
1   RPS4X   RPL10A  216.485291
2   RPS4X   RPL39   214.191076
3   RPS4X   RPL32   212.309751
4   RPS4X   RPS3A   212.084918

It could be something to do with your inputs to the grnboost2.

Chris

gouinK commented 5 years ago

Sorry for the delay in my response, I recently reran it on a newly created miniconda environment and it magically worked this time with literally the same input matrix. Maybe there was something wrong with my instance the first time around? Thanks for your help anyways!

gouinK commented 5 years ago

I guess I should mention that the first time (where the program seemed to stall) I was running the script through command line (i.e. python3.7 test.py > testStdOut.txt), whereas this time I actually opened python and ran each line individually so that I could check what the adjacencies matrix looked like.

cflerin commented 5 years ago

Glad you got it working! I'll close this since you got it working, but it's hard to say what happened the first time without seeing what you're doing in test.py, or seeing the first few lines of your grnboost2 output, since it apparently finished. Feel free to post these if you want to look into it further.

gouinK commented 5 years ago

So it looks like I spoke too soon in that it is actually stalling at the prune2df step and not the list modules step.

Here is my basic script: import os import glob import pickle import pandas as pd import numpy as np from dask.diagnostics import ProgressBar from arboreto.utils import load_tf_names from arboreto.algo import grnboost2 from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase from pyscenic.utils import modules_from_adjacencies, load_motifs from pyscenic.prune import prune2df, df2regulons from pyscenic.aucell import aucell import seaborn as sns

DATA_FOLDER="resources/" RESOURCES_FOLDER="resources/" DATABASE_FOLDER = "databases/"

DATABASES_GLOB = os.path.join(DATABASE_FOLDER, "hg19-*.feather") MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.hgnc-m0.001-o0.0.tbl") MM_TFS_FNAME = "hgnc_tfs.txt" SC_EXP_FNAME = os.path.join(RESOURCES_FOLDER, "Human_Macs_Subsetted_ScenicInput.csv") REGULONS_FNAME = os.path.join(DATA_FOLDER, "regulons.p") MOTIFS_FNAME = os.path.join(DATA_FOLDER, "motifs.csv")

ex_matrix = pd.read_csv(SC_EXP_FNAME, header=0, index_col=0).T ex_matrix.info(verbose=False)

tf_names = load_tf_names(MM_TFS_FNAME)

db_fnames = glob.glob('databases/hg19-*.feather') def name(fname): return os.path.basename(fname).split(".")[0] dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]

adjacencies = grnboost2(ex_matrix, tf_names=tf_names, verbose=True) modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

the line below is where it is stalling

df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME) regulons = df2regulons(df)

df.to_csv(MOTIFS_FNAME) with open(REGULONS_FNAME, "wb") as f: pickle.dump(regulons, f)

df = load_motifs(MOTIFS_FNAME) with open(REGULONS_FNAME, "rb") as f: regulons = pickle.load(f)

auc_mtx = aucell(ex_matrix, regulons, num_workers=8) auc_mtx.to_pickle('auc_mtx.pkl') auc_mtx.to_csv('auc_mtx.csv')

I took a look at the output of the adjacencies matrix and it looks exactly how you posted above except I noticed that the first line does not start at 0 in the first column. I cannot post it right now as I am running the script in a screen so my output is off of the screen, but I can post it if I decide to force quit the current run.

cflerin commented 5 years ago

Can you be a more clear about what you mean when you say that the pruning step is stalling? How long has it been running, can you see any cpu usage in top, etc.? It's hard to tell what's going on without knowing what you're seeing.

You could try running the command line version of the pruning step (which also includes the module creation step). It would be something like:

pyscenic ctx \
  adj.tsv \
  hg19-db1.feather \
  hg19-db2.feather \
  --annotations_fname motifs-v9-nr.hgnc-m0.001-o0.0.tbl \
  --expression_mtx_fname expression_matrix.tsv \
  --mode "dask_multiprocessing" \
  --output reg.csv \
  --num_workers 20
gouinK commented 5 years ago

Sorry, so the memory usage has been at 4.51G for about 36 hours now - the load is 1 so it is not running any parallel jobs. I am going to force quit this run and try the command line version right now, thanks!

gouinK commented 5 years ago

So this is the head of the adjacencies matrix (I used adjacencies.to_csv() to save it to file):

,TF,target,importance 824,MAF,STAB1,129.36672685280823 824,MAF,DAB2,127.05572206534406 824,MAF,SELENOP,109.76342511176964 824,MAF,F13A1,105.82089978931855 678,THAP2,RBKS,98.23605680150084 578,NR1H3,APOC1,97.68215870794278 824,MAF,LYVE1,94.52372391455437 578,NR1H3,GM2A,93.35152484963164 170,PPARG,FABP5,89.729482378842

And now I am getting this error with the CLI: 2019-07-19 19:36:59,536 - pyscenic.cli.pyscenic - INFO - Creating modules. 2019-07-19 19:36:59,602 - pyscenic.cli.pyscenic - ERROR - could not convert string to float: 'STAB1'

cflerin commented 5 years ago

Ah, your problem here is that there's an extra column in your adjacencies file (you've written out the index along with the data). You can avoid this by doing something like:

adjacencies.to_csv("adj.csv", index=False, sep=',')

But since you already have the file, it would be simple to remove the first column:

cut -d',' -f2- adj.csv > adj_without_index.csv

and then use this file instead for the command line step.

Strange about your stalled run though, I don't see why that would happen, sorry I can't offer more help.

gouinK commented 5 years ago

Thanks! So it is now running - it is on the "Calculating regulons" step. I will let it run over night, but I am concerned that it is still going to stall because it has been running on only 1 thread for about 1.5 hours now with a constant 1.53G memory usage, and I don't see any progress bar being printed.

gouinK commented 5 years ago

So I think I figured out my issue. I tried running this cli command on my local desktop and it is showing the progress bar and actually running things in parallel as it should. After about 4 minutes it is showing 2% completed so I believe it is not going to stall.

For some reason it does not work if I run it on the computing cluster that I use at my institution and it does not work on an aws instance. Do you have any suggestions for running on aws? My local computer is going to restrict me in terms of the size of the dataset I can work with.

Thanks again for all of your help through this!

cflerin commented 5 years ago

What kind of issues are you having with your cluster and AWS? And how are you running pySCENIC on those machines? Conda environment?