aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

AssertionError in prune2df #132

Closed Matthias3033 closed 4 years ago

Matthias3033 commented 4 years ago

Hi,

I get the following error message when I use the function prune2df:

AssertionError Traceback (most recent call last)

in 3 # Calculate a list of enriched motifs and the corresponding target genes for all modules. 4 with ProgressBar(): ----> 5 df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME_HS) 6 7 # Create regulons from this table of enriched motifs. ~/miniconda3/lib/python3.7/site-packages/pyscenic/prune.py in prune2df(rnkdbs, modules, motif_annotations_fname, rank_threshold, auc_threshold, nes_threshold, motif_similarity_fdr, orthologuous_identity_threshold, weighted_recovery, client_or_address, num_workers, module_chunksize, filter_for_annotation) 349 return _distributed_calc(rnkdbs, modules, motif_annotations_fname, transformation_func, aggregation_func, 350 motif_similarity_fdr, orthologuous_identity_threshold, client_or_address, --> 351 num_workers, module_chunksize) 352 353 ~/miniconda3/lib/python3.7/site-packages/pyscenic/prune.py in _distributed_calc(rnkdbs, modules, motif_annotations_fname, transform_func, aggregate_func, motif_similarity_fdr, orthologuous_identity_threshold, client_or_address, num_workers, module_chunksize) 298 if client_or_address == "dask_multiprocessing": 299 # ... via multiprocessing. --> 300 return create_graph().compute(scheduler='processes', num_workers=num_workers if num_workers else cpu_count()) 301 else: 302 # ... via dask.distributed framework. ~/miniconda3/lib/python3.7/site-packages/dask/base.py in compute(self, **kwargs) 154 dask.base.compute 155 """ --> 156 (result,) = compute(self, traverse=False, **kwargs) 157 return result 158 ~/miniconda3/lib/python3.7/site-packages/dask/base.py in compute(*args, **kwargs) 395 keys = [x.__dask_keys__() for x in collections] 396 postcomputes = [x.__dask_postcompute__() for x in collections] --> 397 results = schedule(dsk, keys, **kwargs) 398 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) 399 ~/miniconda3/lib/python3.7/site-packages/dask/multiprocessing.py in get(dsk, keys, num_workers, func_loads, func_dumps, optimize_graph, **kwargs) 190 get_id=_process_get_id, dumps=dumps, loads=loads, 191 pack_exception=pack_exception, --> 192 raise_exception=reraise, **kwargs) 193 finally: 194 if cleanup: ~/miniconda3/lib/python3.7/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs) 499 _execute_task(task, data) # Re-execute locally 500 else: --> 501 raise_exception(exc, tb) 502 res, worker_id = loads(res_info) 503 state['cache'][key] = res ~/miniconda3/lib/python3.7/site-packages/dask/compatibility.py in reraise(exc, tb) 110 if exc.__traceback__ is not tb: 111 raise exc.with_traceback(tb) --> 112 raise exc 113 114 else: ~/miniconda3/lib/python3.7/site-packages/dask/local.py in execute_task() 270 try: 271 task, data = loads(task_info) --> 272 result = _execute_task(task, data) 273 id = get_id() 274 result = dumps((result, id)) ~/miniconda3/lib/python3.7/site-packages/dask/local.py in _execute_task() 250 elif istask(arg): 251 func, args = arg[0], arg[1:] --> 252 args2 = [_execute_task(a, cache) for a in args] 253 return func(*args2) 254 elif not ishashable(arg): ~/miniconda3/lib/python3.7/site-packages/dask/local.py in () 250 elif istask(arg): 251 func, args = arg[0], arg[1:] --> 252 args2 = [_execute_task(a, cache) for a in args] 253 return func(*args2) 254 elif not ishashable(arg): ~/miniconda3/lib/python3.7/site-packages/dask/local.py in _execute_task() 251 func, args = arg[0], arg[1:] 252 args2 = [_execute_task(a, cache) for a in args] --> 253 return func(*args2) 254 elif not ishashable(arg): 255 return arg ~/miniconda3/lib/python3.7/site-packages/pyscenic/transform.py in modules2df() 229 #TODO: Remove this restriction. 230 return pd.concat([module2df(db, module, motif_annotations, weighted_recovery, False, module2features_func) --> 231 for module in modules]) 232 233 ~/miniconda3/lib/python3.7/site-packages/pyscenic/transform.py in () 229 #TODO: Remove this restriction. 230 return pd.concat([module2df(db, module, motif_annotations, weighted_recovery, False, module2features_func) --> 231 for module in modules]) 232 233 ~/miniconda3/lib/python3.7/site-packages/pyscenic/transform.py in module2df() 183 try: 184 df_annotated_features, rccs, rankings, genes, avg2stdrcc = module2features_func(db, module, motif_annotations, --> 185 weighted_recovery=weighted_recovery) 186 except MemoryError: 187 LOGGER.error("Unable to process \"{}\" on database \"{}\" because ran out of memory. Stacktrace:".format(module.name, db.name)) ~/miniconda3/lib/python3.7/site-packages/pyscenic/transform.py in module2features_auc1st_impl() 127 # Calculate recovery curves, AUC and NES values. 128 # For fast unweighted implementation so weights to None. --> 129 aucs = calc_aucs(df, db.total_genes, weights, auc_threshold) 130 ness = (aucs - aucs.mean()) / aucs.std() 131 ~/miniconda3/lib/python3.7/site-packages/pyscenic/recovery.py in aucs() 282 # for calculationg the maximum AUC. 283 maxauc = float((rank_cutoff+1) * y_max) --> 284 assert maxauc > 0 285 return auc2d(rankings, weights, rank_cutoff, maxauc) AssertionError: As ranking database I use homo sapiens. I do not receive this error message when using Mus musculus for another data set. The error mentioned under issue 85 is not present here. Does anyone have an idea how to fix this error?
cflerin commented 4 years ago

Hi @Matthias3033 ,

Can you list the databases you are using here? From the error, it sounds like there were no genes found in database that overlap with your data.

Matthias3033 commented 4 years ago

Hi @cflerin,

these are the databases that I use: FeatherRankingDatabase(name="hg19-tss-centered-10kb-10species.mc9nr"), FeatherRankingDatabase(name="hg19-tss-centered-10kb-7species.mc9nr"), FeatherRankingDatabase(name="hg19-tss-centered-5kb-10species.mc9nr"), FeatherRankingDatabase(name="hg19-500bp-upstream-7species.mc9nr"), FeatherRankingDatabase(name="hg19-tss-centered-5kb-7species.mc9nr"), FeatherRankingDatabase(name="hg19-500bp-upstream-10species.mc9nr")

cflerin commented 4 years ago

The databases look fine (although there's no need to use the 7-species when also using 10-species, but it won't cause issues). Are you also using the correct motif annotations file (for human)? How many genes in your expression matrix? And how many modules do you have?

Matthias3033 commented 4 years ago

I am using the correct motif file. The number of genes is 17098. How do I get the number of modules? (with len(modules) I get 4996)

cflerin commented 4 years ago

Just noticed:

186 except MemoryError:
187 LOGGER.error("Unable to process "{}" on database "{}" because ran out of memory.

which seems self-explanatory. You could try taking out the three 7-species databases and see if it works with the remaining databases.

Matthias3033 commented 4 years ago

Same error. I've also tried it with only one 7 species database - still the same error.

cflerin commented 4 years ago

How much memory do you have available on your machine? You could try reducing the number of processes that pyscenic is using...

Matthias3033 commented 4 years ago

How can I reduce the number of processes?

bramvds commented 4 years ago

Via CLI you have the parameter --num_workers N where N specifies the number of cores to use. Using the API for Jupyter notebooks, a similar parameter is available.

For the prune2df function (cistarget step) the parameter name is num_workers. For grnboost, I kindely refer you to the arboreto package documentation: https://github.com/tmoerman/arboreto . Briefly, you need to use a construct like this:

from pyscenic.prune import _prepare_client
from arboreto import grnboost2

client, shutdown_callback = _prepare_client('local_host', num_workers=12)
network = grnboost2(expression_data=ex_mtx, tf_names=tf_names, verbose=True, client_or_address=client)
Matthias3033 commented 4 years ago

How much memory do you have available on your machine? You could try reducing the number of processes that pyscenic is using...

I ideally have 120 gb of RAM, so the memory should normally not be a problem