aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
442 stars 182 forks source link

[FEATURE REQUEST] Progress tracking on long-running jobs #217

Open MaybeJustJames opened 4 years ago

MaybeJustJames commented 4 years ago

Many SCENIC jobs are very long running and a user can wonder if progress is being made. Having a mechanism to track progress would be very useful.

cflerin commented 4 years ago

Hi @MaybeJustJames ,

Thanks for the suggestion. pySCENIC does actually have process tracking built in for a number of steps, although it's maybe not always obvious.

But maybe you could tell me your specific use case and what kind of progress tracking would help you?

MaybeJustJames commented 4 years ago

This is filed on behalf of @saeedfc. Could you please respond to this @saeedfc?

saeedfc commented 4 years ago

Hi @MaybeJustJames ,

Thanks for the suggestion. pySCENIC does actually have process tracking built in for a number of steps, although it's maybe not always obvious.

  • GRN step: Running via Dask, you can connect to the Dask dashboard through a browser and look at the status there. If using the multiprocessing script, there's already a progress bar from tqdm.
  • The ctx and AUCell steps have a tqdm progress bar (in both CLI and interactive use, I believe)

But maybe you could tell me your specific use case and what kind of progress tracking would help you?

Hi @cflerin

I had a situation when I used your standard pipeline using dask. I had a dataset of 13k cells from 10x of human samples. Usually, that kind of data never takes more than 20-30 hours for me. However, this one took 5-6 days and still never finished computing the adjacencies. I was not sure whether it is just taking longer time or whether it was a dask issue.

Especially because I was troubleshooting the problems with dask and trying the solutions discussed in other threads here. I finally had to kill the process.

Here is what I tried.

import os
import glob
import pickle
import pandas as pd
import numpy as np

from dask.diagnostics import ProgressBar

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, load_motifs
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell
from pyscenic.aucell import derive_auc_threshold
from pyscenic.binarization import binarize
import seaborn as sns

if __name__ == '__main__':
    DATA_FOLDER = '/mnt/DATA1/Fibrosis/Full Scale Analysis/SCENIC/Myeloid Cells'
    RESOURCES_FOLDER ='/mnt/DATA1/Fibrosis/Human Integration and Clustering/COVID/Epithelial Cells/SCENIC/RESOURCES_FOLDER'
    DATABASES_GLOB = os.path.join(RESOURCES_FOLDER, "hg38*.mc9nr.feather")
    MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.hgnc-m0.001-o0.0.tbl")
    MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'TFs.txt')
    REGULONS_FNAME = os.path.join(DATA_FOLDER, "Regulons_Myeloid.p")
    MOTIFS_FNAME = os.path.join(DATA_FOLDER, "Regulons_motifs_Myeloid.csv")
    ex_matrix = pd.read_csv("/mnt/DATA1/Fibrosis/Full Scale Analysis/SCENIC/Myeloid Cells/myeloid_expression.csv", sep = ",", header=0, index_col=0)
    ex_matrix.shape
    tf_names = load_tf_names(MM_TFS_FNAME)
    db_fnames = glob.glob(DATABASES_GLOB)

    def name(fname):
        return os.path.basename(fname).split(".")[0]
    dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
    dbs
    adjacencies = grnboost2(ex_matrix, tf_names=tf_names, verbose=True, seed = 777)
    adjacencies.to_csv("/mnt/DATA1/Fibrosis/Full Scale Analysis/SCENIC/Myeloid Cells/Adjacencies_Myeloid.csv", index = False, sep = '\t')

Below is what I have on screen and it was there for 5-6 days.

preparing dask client
parsing input
/home/luna.kuleuven.be/u0119129/anaconda3/lib/python3.7/site-packages/arboreto/algo.py:214: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  expression_matrix = expression_data.as_matrix()
creating dask graph
6 partitions
computing dask graph

I see some documentation here https://docs.dask.org/en/latest/diagnostics-distributed.html about tracking. But maybe you can give a guideline how we can track this on a local machine. Simple giving as below in the beginning is fine?

from dask.distributed import Client
client = Client()  # start distributed scheduler locally.  Launch dashboard

Thanks a lot for your time. @MaybeJustJames and @cflerin

cflerin commented 4 years ago

Ok, thanks for describing your workflow, @saeedfc. If you want to monitor the Dask progress, the first thing I would suggest is to check out the tutorial in Arboreto describing how to connect to the Dask scheduler.

Otherwise, I see what you mean about progress reporting to the command prompt but I don't know if we change that really. One thing I would suggest if you're having problems with the GRN step is to try the multiprocessing script, which is more stable, and gives more a informative progress report.

saeedfc commented 4 years ago

Thank you @cflerin . I shall try the multiprocessing script and maybe the dask scheduler as well.

Thanks and Kind Regards, Saeed