Open MaybeJustJames opened 4 years ago
Hi @MaybeJustJames ,
Thanks for the suggestion. pySCENIC does actually have process tracking built in for a number of steps, although it's maybe not always obvious.
But maybe you could tell me your specific use case and what kind of progress tracking would help you?
This is filed on behalf of @saeedfc. Could you please respond to this @saeedfc?
Hi @MaybeJustJames ,
Thanks for the suggestion. pySCENIC does actually have process tracking built in for a number of steps, although it's maybe not always obvious.
- GRN step: Running via Dask, you can connect to the Dask dashboard through a browser and look at the status there. If using the multiprocessing script, there's already a progress bar from tqdm.
- The ctx and AUCell steps have a tqdm progress bar (in both CLI and interactive use, I believe)
But maybe you could tell me your specific use case and what kind of progress tracking would help you?
Hi @cflerin
I had a situation when I used your standard pipeline using dask. I had a dataset of 13k cells from 10x of human samples. Usually, that kind of data never takes more than 20-30 hours for me. However, this one took 5-6 days and still never finished computing the adjacencies. I was not sure whether it is just taking longer time or whether it was a dask issue.
Especially because I was troubleshooting the problems with dask and trying the solutions discussed in other threads here. I finally had to kill the process.
Here is what I tried.
import os
import glob
import pickle
import pandas as pd
import numpy as np
from dask.diagnostics import ProgressBar
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, load_motifs
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell
from pyscenic.aucell import derive_auc_threshold
from pyscenic.binarization import binarize
import seaborn as sns
if __name__ == '__main__':
DATA_FOLDER = '/mnt/DATA1/Fibrosis/Full Scale Analysis/SCENIC/Myeloid Cells'
RESOURCES_FOLDER ='/mnt/DATA1/Fibrosis/Human Integration and Clustering/COVID/Epithelial Cells/SCENIC/RESOURCES_FOLDER'
DATABASES_GLOB = os.path.join(RESOURCES_FOLDER, "hg38*.mc9nr.feather")
MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.hgnc-m0.001-o0.0.tbl")
MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'TFs.txt')
REGULONS_FNAME = os.path.join(DATA_FOLDER, "Regulons_Myeloid.p")
MOTIFS_FNAME = os.path.join(DATA_FOLDER, "Regulons_motifs_Myeloid.csv")
ex_matrix = pd.read_csv("/mnt/DATA1/Fibrosis/Full Scale Analysis/SCENIC/Myeloid Cells/myeloid_expression.csv", sep = ",", header=0, index_col=0)
ex_matrix.shape
tf_names = load_tf_names(MM_TFS_FNAME)
db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
return os.path.basename(fname).split(".")[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
dbs
adjacencies = grnboost2(ex_matrix, tf_names=tf_names, verbose=True, seed = 777)
adjacencies.to_csv("/mnt/DATA1/Fibrosis/Full Scale Analysis/SCENIC/Myeloid Cells/Adjacencies_Myeloid.csv", index = False, sep = '\t')
Below is what I have on screen and it was there for 5-6 days.
preparing dask client
parsing input
/home/luna.kuleuven.be/u0119129/anaconda3/lib/python3.7/site-packages/arboreto/algo.py:214: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
expression_matrix = expression_data.as_matrix()
creating dask graph
6 partitions
computing dask graph
I see some documentation here https://docs.dask.org/en/latest/diagnostics-distributed.html about tracking. But maybe you can give a guideline how we can track this on a local machine. Simple giving as below in the beginning is fine?
from dask.distributed import Client
client = Client() # start distributed scheduler locally. Launch dashboard
Thanks a lot for your time. @MaybeJustJames and @cflerin
Ok, thanks for describing your workflow, @saeedfc. If you want to monitor the Dask progress, the first thing I would suggest is to check out the tutorial in Arboreto describing how to connect to the Dask scheduler.
Otherwise, I see what you mean about progress reporting to the command prompt but I don't know if we change that really. One thing I would suggest if you're having problems with the GRN step is to try the multiprocessing script, which is more stable, and gives more a informative progress report.
Thank you @cflerin . I shall try the multiprocessing script and maybe the dask scheduler as well.
Thanks and Kind Regards, Saeed
Many SCENIC jobs are very long running and a user can wonder if progress is being made. Having a mechanism to track progress would be very useful.