Closed gouinK closed 5 years ago
Hello,
Something looks a bit strange with your adjacencies matrix. Can you post the first few lines? It should look something like:
TF target importance
0 TBX21 NKG7 219.412288
1 RPS4X RPL10A 216.485291
2 RPS4X RPL39 214.191076
3 RPS4X RPL32 212.309751
4 RPS4X RPS3A 212.084918
It could be something to do with your inputs to the grnboost2.
Chris
Sorry for the delay in my response, I recently reran it on a newly created miniconda environment and it magically worked this time with literally the same input matrix. Maybe there was something wrong with my instance the first time around? Thanks for your help anyways!
I guess I should mention that the first time (where the program seemed to stall) I was running the script through command line (i.e. python3.7 test.py > testStdOut.txt), whereas this time I actually opened python and ran each line individually so that I could check what the adjacencies matrix looked like.
Glad you got it working! I'll close this since you got it working, but it's hard to say what happened the first time without seeing what you're doing in test.py
, or seeing the first few lines of your grnboost2 output, since it apparently finished. Feel free to post these if you want to look into it further.
So it looks like I spoke too soon in that it is actually stalling at the prune2df step and not the list modules step.
Here is my basic script: import os import glob import pickle import pandas as pd import numpy as np from dask.diagnostics import ProgressBar from arboreto.utils import load_tf_names from arboreto.algo import grnboost2 from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase from pyscenic.utils import modules_from_adjacencies, load_motifs from pyscenic.prune import prune2df, df2regulons from pyscenic.aucell import aucell import seaborn as sns
DATA_FOLDER="resources/" RESOURCES_FOLDER="resources/" DATABASE_FOLDER = "databases/"
DATABASES_GLOB = os.path.join(DATABASE_FOLDER, "hg19-*.feather") MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.hgnc-m0.001-o0.0.tbl") MM_TFS_FNAME = "hgnc_tfs.txt" SC_EXP_FNAME = os.path.join(RESOURCES_FOLDER, "Human_Macs_Subsetted_ScenicInput.csv") REGULONS_FNAME = os.path.join(DATA_FOLDER, "regulons.p") MOTIFS_FNAME = os.path.join(DATA_FOLDER, "motifs.csv")
ex_matrix = pd.read_csv(SC_EXP_FNAME, header=0, index_col=0).T ex_matrix.info(verbose=False)
tf_names = load_tf_names(MM_TFS_FNAME)
db_fnames = glob.glob('databases/hg19-*.feather') def name(fname): return os.path.basename(fname).split(".")[0] dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
adjacencies = grnboost2(ex_matrix, tf_names=tf_names, verbose=True) modules = list(modules_from_adjacencies(adjacencies, ex_matrix))
df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME) regulons = df2regulons(df)
df.to_csv(MOTIFS_FNAME) with open(REGULONS_FNAME, "wb") as f: pickle.dump(regulons, f)
df = load_motifs(MOTIFS_FNAME) with open(REGULONS_FNAME, "rb") as f: regulons = pickle.load(f)
auc_mtx = aucell(ex_matrix, regulons, num_workers=8) auc_mtx.to_pickle('auc_mtx.pkl') auc_mtx.to_csv('auc_mtx.csv')
I took a look at the output of the adjacencies matrix and it looks exactly how you posted above except I noticed that the first line does not start at 0 in the first column. I cannot post it right now as I am running the script in a screen so my output is off of the screen, but I can post it if I decide to force quit the current run.
Can you be a more clear about what you mean when you say that the pruning step is stalling? How long has it been running, can you see any cpu usage in top, etc.? It's hard to tell what's going on without knowing what you're seeing.
You could try running the command line version of the pruning step (which also includes the module creation step). It would be something like:
pyscenic ctx \
adj.tsv \
hg19-db1.feather \
hg19-db2.feather \
--annotations_fname motifs-v9-nr.hgnc-m0.001-o0.0.tbl \
--expression_mtx_fname expression_matrix.tsv \
--mode "dask_multiprocessing" \
--output reg.csv \
--num_workers 20
Sorry, so the memory usage has been at 4.51G for about 36 hours now - the load is 1 so it is not running any parallel jobs. I am going to force quit this run and try the command line version right now, thanks!
So this is the head of the adjacencies matrix (I used adjacencies.to_csv() to save it to file):
,TF,target,importance 824,MAF,STAB1,129.36672685280823 824,MAF,DAB2,127.05572206534406 824,MAF,SELENOP,109.76342511176964 824,MAF,F13A1,105.82089978931855 678,THAP2,RBKS,98.23605680150084 578,NR1H3,APOC1,97.68215870794278 824,MAF,LYVE1,94.52372391455437 578,NR1H3,GM2A,93.35152484963164 170,PPARG,FABP5,89.729482378842
And now I am getting this error with the CLI: 2019-07-19 19:36:59,536 - pyscenic.cli.pyscenic - INFO - Creating modules. 2019-07-19 19:36:59,602 - pyscenic.cli.pyscenic - ERROR - could not convert string to float: 'STAB1'
Ah, your problem here is that there's an extra column in your adjacencies file (you've written out the index along with the data). You can avoid this by doing something like:
adjacencies.to_csv("adj.csv", index=False, sep=',')
But since you already have the file, it would be simple to remove the first column:
cut -d',' -f2- adj.csv > adj_without_index.csv
and then use this file instead for the command line step.
Strange about your stalled run though, I don't see why that would happen, sorry I can't offer more help.
Thanks! So it is now running - it is on the "Calculating regulons" step. I will let it run over night, but I am concerned that it is still going to stall because it has been running on only 1 thread for about 1.5 hours now with a constant 1.53G memory usage, and I don't see any progress bar being printed.
So I think I figured out my issue. I tried running this cli command on my local desktop and it is showing the progress bar and actually running things in parallel as it should. After about 4 minutes it is showing 2% completed so I believe it is not going to stall.
For some reason it does not work if I run it on the computing cluster that I use at my institution and it does not work on an aws instance. Do you have any suggestions for running on aws? My local computer is going to restrict me in terms of the size of the dataset I can work with.
Thanks again for all of your help through this!
What kind of issues are you having with your cluster and AWS? And how are you running pySCENIC on those machines? Conda environment?
Greetings, I have run this package successfully on some of my datasets in the past, but I am rerunning it with pyscenic v0.9.9 (python 3.7) now and I am having an issue where the modules = list(modules_from_adjacencies(adjacencies, ex_matrix)) function never finishes (i.e. it has been running for about 3 days and seems to be stalled at 4.30G memory usage). The grnboost2 function runs correctly and seems to complete, it outputs this:
<class 'pandas.core.frame.DataFrame'> Index: 6000 entries, h1C_TGTGGTAAGGGCTTCC to h6A_ATCCGAATCCAACCAA Columns: 16607 entries, WASH9P to MYH1 dtypes: float64(16607) memory usage: 760.3+ MB preparing dask client parsing input creating dask graph 4 partitions computing dask graph shutting down client and local cluster finished
Are there any parameters that I could check that might be causing this? Thanks for all of your help!