aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
444 stars 183 forks source link

KeyError 'Field <GENE> does not exist in schema' at `prune2df` step for tutorial #589

Open tuanpham96 opened 1 month ago

tuanpham96 commented 1 month ago

Description

I'm running the tutorial and I keep getting the errors at the prune2df step like this:

Exception: 'KeyError(\'Field "Snora5c" exists 2 times in schema\')'
[...]
Exception: 'KeyError(\'Field "1600002H07Rik" does not exist in schema\')'

Steps to reproduce the behavior

  1. Command run when the error occurred:
Import & Define resources ```python # import import os import sys import glob import re import numpy as np import pandas as pd from dask.diagnostics import ProgressBar from dask.distributed import Client, LocalCluster from arboreto.utils import load_tf_names from arboreto.algo import grnboost2, genie3 from ctxcore.rnkdb import FeatherRankingDatabase as RankingDatabase from pyscenic.utils import modules_from_adjacencies, load_motifs from pyscenic.prune import prune2df, df2regulons from pyscenic.aucell import aucell # define paths OUTPUT_DATA_FOLDER = "data/grn" INPUT_EXPR_FILE = 'data/external/geo/GSE60361_C1-3005-Expression.txt' RESOURCES_DIRECTORY = "data/external/aertslab/resources.aertslab.org/cistarget" DATABASES_GLOB = os.path.join( RESOURCES_DIRECTORY, "databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/", "mm9-*.mc9nr.genes_vs_motifs.rankings.feather" ) MOTIF_ANNOTATIONS_FNAME = os.path.join( RESOURCES_DIRECTORY, "motif2tf/motifs-v9-nr.mgi-m0.001-o0.0.tbl" ) MM_TFS_FNAME = os.path.join( RESOURCES_DIRECTORY, "tf_lists/allTFs_mm.txt" ) REGULONS_FNAME = os.path.join(OUTPUT_DATA_FOLDER, "regulons.p") MOTIFS_FNAME = os.path.join(OUTPUT_DATA_FOLDER, "motifs.csv") ``` Here's what the resource directory looks like: ``` data/external/aertslab/resources.aertslab.org/cistarget ├── databases │ └── mus_musculus │ ├── mm10 │ │ ├── refseq_r80 │ │ │ ├── mc9nr │ │ │ │ └── gene_based │ │ │ └── mc_v10_clust │ │ │ └── gene_based │ │ └── screen │ │ └── mc_v10_clust │ │ └── region_based │ └── mm9 │ ├── refseq_r45 │ │ └── mc9nr │ │ └── gene_based │ │ ├── mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather │ │ └── mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ └── refseq_r70 │ └── mc9nr │ └── region_based ├── motif2tf │ ├── motifs-v10nr_clust-nr.chicken-m0.001-o0.0.tbl │ ├── motifs-v10nr_clust-nr.flybase-m0.001-o0.0.tbl │ ├── motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl │ ├── motifs-v10nr_clust-nr.mgi-m0.001-o0.0.tbl │ ├── motifs-v8-nr.flybase-m0.001-o0.0.tbl │ ├── motifs-v9-nr.flybase-m0.001-o0.0.tbl │ ├── motifs-v9-nr.hgnc-m0.001-o0.0.tbl │ └── motifs-v9-nr.mgi-m0.001-o0.0.tbl └── tf_lists ├── allTFs_dmel.txt ├── allTFs_hg38.txt └── allTFs_mm.txt ```
Load data ```python ex_matrix = pd.read_csv(INPUT_EXPR_FILE, sep='\t', header=0, index_col=0).T ex_matrix.shape --- (3005, 19972) ``` ```python tf_names = load_tf_names(MM_TFS_FNAME) len(tf_names) --- 1860 ``` ```python db_fnames = glob.glob(DATABASES_GLOB) def name(fname): return os.path.splitext(os.path.basename(fname))[0] dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames] dbs --- [FeatherRankingDatabase(name="mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings")] ```

Then the steps as in the tutorials:

adjacencies = grnboost2(
    ex_matrix, 
    tf_names=tf_names, 
    verbose=True
)

modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

The above steps worked fine. Then to prune2df, which didn't work:

Since I'm running on university HPC, I followed this comment:

with ProgressBar():
    df = prune2df(
        dbs, modules, MOTIF_ANNOTATIONS_FNAME,
        client_or_address=Client(LocalCluster())
    )
  1. Error encountered:

Here's a snippet of the trace back:


KeyError: 'Field "1810058I24Rik" does not exist in schema'
2024-10-23 12:20:05,208 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 44894)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7f2faeb34430>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7f2faeb34c10>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Zmat2', gene2weight=frozendict.frozendict({'0610009L18Rik': 0.24334112928456972, '0610010F05Rik': 0.12469187083208143, '0610011F06Rik': 0.07426427896756556, '0610040B10Rik': 0.5023871257709747, '1110002L01Rik': 0.25781507539505055, '1110004E09Rik': 2.0959058794181535, '1110004F10Rik': 0.4213780187493597, '1110008L16Rik': 0.2437094347255712, '1110008P14Rik': 1.2910006242509904, '1110032F04Rik': 0.33099966516163515, '1110035M17Rik': 1.2395167785396992, '1110037F02Rik': 0.9282107870741647, '1110038F14Rik': 1.939926191405302, '1110046J04Rik': 0.2586313695021531, 
kwargs:    {}
Exception: 'KeyError(\'Field "0610010K14Rik" does not exist in schema\')'
Full traceback ```pytb /users//data//conda-env/sequencing/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 40923 instead warnings.warn( /users//data//conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:3169: UserWarning: Sending large graph of size 67.13 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures. warnings.warn( 2024-10-23 12:20:00,160 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Snapc5 could be mapped to mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings. Skipping this module. 2024-10-23 12:20:00,317 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Zkscan8 could be mapped to mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings. Skipping this module. 2024-10-23 12:20:00,375 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Snai2 could be mapped to mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings. Skipping this module. 2024-10-23 12:20:00,558 - pyscenic.transform - WARNING - Less than 80% of the genes in Zscan4f could be mapped to mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings. Skipping this module. 2024-10-23 12:20:00,703 - distributed.worker - WARNING - Compute Failed Key: ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9993) Function: execute_task args: ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(, module2features_func=functools.partial(, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Snai3', gene2weight=frozendict.frozendict({'Nptxr': 3.71858079580093, '4930422G04Rik': 3.4189593778885565, 'Tst': 3.4019598703995646, 'Gm16982': 3.2794141345795023, 'Fam71d': 2.770652596133948, 'Trim65': 2.5165290070963446, '6430571L13Rik': 2.322653157302345, 'Snora68': 2.2677351112931254, 'Mir1983': 2.1049208171740688, '4833412C05Rik': 1.9567400648623148, 'Slfn9': 1.918912519709758, 'Myh14': 1.8155076543676392, 'Adrb3': 1.7655713128799888, 'Psd4': 1.7593517666032166, 'Snora5c': 1.742871015263215, 'Armc2': 1.722555406741925, 'Vmn2r87': 1.697990923076 kwargs: {} Exception: 'KeyError(\'Field "Snora5c" exists 2 times in schema\')' 2024-10-23 12:20:00,708 - distributed.worker - WARNING - Compute Failed Key: ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9994) Function: execute_task args: ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(, module2features_func=functools.partial(, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Snapc4', gene2weight=frozendict.frozendict({'C330006A16Rik': 4.349251118259209, 'C130060C02Rik': 4.208884115579275, 'Gm7854': 3.5410158417095827, 'Sarm1': 2.927046278423663, 'Lenep': 2.785681164068822, 'B3galt6': 2.6131502759106913, 'Gm11202': 2.462704280429404, 'Sgcz': 2.3117112215460534, 'Lrpprc': 1.9038514495386707, 'Nsun7': 1.8296346841767743, 'Gpsm2': 1.8010605529771901, 'Pcnxl3': 1.7919756056901275, 'Gm5801_loc2': 1.7491126804317503, 'Wdr31': 1.570558011998353, 'Cxcl9': 1.556944185120764, 'Kctd8': 1.5382108739561857, 'Cad': 1.5234094823248008, kwargs: {} Exception: 'KeyError(\'Field "1600002H07Rik" does not exist in schema\')' [... TRUNCATED DUE TO LIMIT ON GITHUB ISSUE ...] 2024-10-23 12:20:03,485 - distributed.worker - WARNING - Compute Failed Key: ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9956) Function: execute_task args: ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(, module2features_func=functools.partial(, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Sfpq', gene2weight=frozendict.frozendict({'Erdr1': 3.0321972987058734, 'Zc3h11a': 2.691669912271372, 'Fnbp4': 2.6088463086281473, '1110037F02Rik': 2.0684340440320277, 'Tdrd7': 2.0345506791813444, 'Polr1a': 2.010575150065573, 'Tnrc6c': 1.960275733025441, 'Nup155': 1.8890378889442363, 'Crebzf': 1.8710276416734972, 'Arid2': 1.8643832712747197, '1810026B05Rik': 1.8462831579510215, 'Snrnp70': 1.844349613916035, 'Pprc1': 1.7901300300003178, 'Cntnap5a': 1.7415507727206572, '0610009O20Rik': 1.707972820514252, 'Chd9': 1.6735837131019469, 'Pkd1l3': 1.661729812 kwargs: {} Exception: 'KeyError(\'Field "0610030E20Rik" does not exist in schema\')' --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[11], line 2 1 with ProgressBar(): ----> 2 df = prune2df( 3 dbs, modules, MOTIF_ANNOTATIONS_FNAME, 4 client_or_address=Client(LocalCluster()) 5 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/prune.py:424, in prune2df(rnkdbs, modules, motif_annotations_fname, rank_threshold, auc_threshold, nes_threshold, motif_similarity_fdr, orthologuous_identity_threshold, weighted_recovery, client_or_address, num_workers, module_chunksize, filter_for_annotation) 418 # Create a distributed dataframe from individual delayed objects to avoid out of memory problems. 419 aggregation_func = ( 420 partial(from_delayed, meta=DF_META_DATA) 421 if client_or_address != "custom_multiprocessing" 422 else pd.concat 423 ) --> 424 return _distributed_calc( 425 rnkdbs, 426 modules, 427 motif_annotations_fname, 428 transformation_func, 429 aggregation_func, 430 motif_similarity_fdr, 431 orthologuous_identity_threshold, 432 client_or_address, 433 num_workers, 434 module_chunksize, 435 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/prune.py:362, in _distributed_calc(rnkdbs, modules, motif_annotations_fname, transform_func, aggregate_func, motif_similarity_fdr, orthologuous_identity_threshold, client_or_address, num_workers, module_chunksize) 357 client, shutdown_callback = _prepare_client( 358 client_or_address, 359 num_workers=num_workers if num_workers else cpu_count(), 360 ) 361 try: --> 362 return client.compute(create_graph(client), sync=True) 363 finally: 364 shutdown_callback(False) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:3502, in Client.compute(self, collections, sync, optimize_graph, workers, allow_other_workers, resources, retries, priority, fifo_timeout, actors, traverse, **kwargs) 3499 futures.append(arg) 3501 if sync: -> 3502 result = self.gather(futures) 3503 else: 3504 result = futures File ~/data//conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:2384, in Client.gather(self, futures, errors, direct, asynchronous) 2381 local_worker = None 2383 with shorten_traceback(): -> 2384 return self.sync( 2385 self._gather, 2386 futures, 2387 errors=errors, 2388 direct=direct, 2389 local_worker=local_worker, 2390 asynchronous=asynchronous, 2391 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:368, in modules2df() 356 def modules2df( 357 db: Type[RankingDatabase], 358 modules: Sequence[Regulon], (...) 365 # to be fixed for the dask framework. 366 # TODO: Remove this restriction. 367 return pd.concat( --> 368 [ 369 module2df( 370 db, 371 module, 372 motif_annotations, 373 weighted_recovery, 374 False, 375 module2features_func, 376 ) 377 for module in modules 378 ] 379 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:369, in () 356 def modules2df( 357 db: Type[RankingDatabase], 358 modules: Sequence[Regulon], (...) 365 # to be fixed for the dask framework. 366 # TODO: Remove this restriction. 367 return pd.concat( 368 [ --> 369 module2df( 370 db, 371 module, 372 motif_annotations, 373 weighted_recovery, 374 False, 375 module2features_func, 376 ) 377 for module in modules 378 ] 379 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:287, in module2df() 285 # Derive enriched and TF-annotated features for module. 286 try: --> 287 df_annotated_features, rccs, rankings, genes, avg2stdrcc = module2features_func( 288 db, module, motif_annotations, weighted_recovery=weighted_recovery 289 ) 290 except MemoryError: 291 LOGGER.error( 292 'Unable to process "{}" on database "{}" because ran out of memory. Stacktrace:'.format( 293 module.name, db.name 294 ) 295 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:176, in module2features_auc1st_impl() 162 """ 163 Create a dataframe of enriched and annotated features a given ranking database and a co-expression module. 164 (...) 172 :return: A dataframe with enriched and annotated features. 173 """ 175 # Load rank of genes from database. --> 176 df = db.load(module) 177 features, genes, rankings = df.index.values, df.columns.values, df.values 178 weights = ( 179 np.asarray([module[gene] for gene in genes]) 180 if weighted_recovery 181 else np.ones(len(genes)) 182 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/ctxcore/rnkdb.py:132, in load() 128 def load(self, gs: GeneSignature) -> pd.DataFrame: 129 # For some genes in the signature there might not be a rank available in the database. 130 gene_set = self.geneset.intersection(set(gs.genes)) --> 132 return self.ct_db.subset_to_pandas( 133 region_or_gene_ids=RegionOrGeneIDs( 134 region_or_gene_ids=gene_set, 135 regions_or_genes_type=self.ct_db.all_region_or_gene_ids.type, 136 ) 137 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:789, in subset_to_pandas() 785 engine = engine if engine else self.engine 787 # Fetch scores or rankings for input region IDs or gene IDs from cisTarget database file for region IDs or 788 # gene IDs which were not prefetched in previous calls. --> 789 self.prefetch(region_or_gene_ids=region_or_gene_ids, engine=engine, sort=True) 791 if not self.df_cached: 792 raise RuntimeError( 793 f"Prefetch failed to retrieve {self.scores_or_rankings} for " 794 f"{region_or_gene_ids} from cisTarget database " 795 f'"{self.ct_db_filename}".' 796 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:739, in prefetch() 734 self._prefetch_as_polars_dataframe( 735 region_or_gene_ids=region_or_gene_ids, use_pyarrow=True, sort=sort 736 ) 737 elif engine == "pyarrow": 738 # Store prefetched data as pyarrow Table (self.df_cached) and read data with pyarrow's native IPC reader. --> 739 self._prefetch_as_pyarrow_table( 740 region_or_gene_ids=region_or_gene_ids, sort=sort 741 ) 742 else: 743 raise ValueError( 744 f'Unsupported engine "{engine}" for reading cisTarget database.' 745 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:678, in _prefetch_as_pyarrow_table() 673 self.region_or_gene_ids_loaded = found_region_or_gene_ids.union( 674 self.region_or_gene_ids_loaded 675 ) 677 # Store new pyarrow Table with previously and newly loaded region IDs or gene IDs scores/rankings. --> 678 self.df_cached = pa_table.select( 679 ( 680 self.region_or_gene_ids_loaded.sort().ids 681 if sort 682 else self.region_or_gene_ids_loaded.ids 683 ) 684 + (self.all_motif_or_track_ids.type.value,) 685 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyarrow/table.pxi:4207, in pyarrow.lib.Table.select() 4205 4206 for idx in columns: -> 4207 idx = self._ensure_integer_index(idx) 4208 idx = _normalize_index(idx, self.num_columns) 4209 c_indices.push_back( idx) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyarrow/table.pxi:1668, in pyarrow.lib._Tabular._ensure_integer_index() 1666 1667 if len(field_indices) == 0: -> 1668 raise KeyError("Field \"{}\" does not exist in schema" 1669 .format(i)) 1670 elif len(field_indices) > 1: KeyError: 'Field "1810058I24Rik" does not exist in schema' 2024-10-23 12:20:05,208 - distributed.worker - WARNING - Compute Failed Key: ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 44894) Function: execute_task args: ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(, module2features_func=functools.partial(, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Zmat2', gene2weight=frozendict.frozendict({'0610009L18Rik': 0.24334112928456972, '0610010F05Rik': 0.12469187083208143, '0610011F06Rik': 0.07426427896756556, '0610040B10Rik': 0.5023871257709747, '1110002L01Rik': 0.25781507539505055, '1110004E09Rik': 2.0959058794181535, '1110004F10Rik': 0.4213780187493597, '1110008L16Rik': 0.2437094347255712, '1110008P14Rik': 1.2910006242509904, '1110032F04Rik': 0.33099966516163515, '1110035M17Rik': 1.2395167785396992, '1110037F02Rik': 0.9282107870741647, '1110038F14Rik': 1.939926191405302, '1110046J04Rik': 0.2586313695021531, kwargs: {} Exception: 'KeyError(\'Field "0610010K14Rik" does not exist in schema\')' ```

Please complete the following information:

aiohttp                   3.10.0
anndata                   0.10.8
arboreto                  0.1.6
arrow                     1.3.0
attrs                     23.2.0
boltons                   24.0.0
cloudpickle               3.0.0
ctxcore                   0.2.0
cytoolz                   0.12.3
dask                      2024.2.1
dask-expr                 0.5.3
distributed               2024.2.1
feather-format            0.4.1
frozendict                2.4.4
fsspec                    2024.6.1
interlap                  0.2.7
llvmlite                  0.43.0
loompy                    3.0.7
matplotlib                3.9.2
matplotlib-inline         0.1.7
multiprocessing_on_dill   3.5.0a4
networkx                  3.3
numba                     0.60.0
numexpr                   2.10.1
numpy                     1.26.4
numpy-groupies            0.11.2
pandas                    2.2.2
pandas-flavor             0.6.0
pyarrow                   17.0.0
pyarrow-hotfix            0.6
pyscenic                  0.12.1+8.gd2309fe
requests                  2.32.3
scanpy                    1.10.2
scikit-learn              1.5.1
scipy                     1.14.0
seaborn                   0.13.2
setuptools                71.0.4
tqdm                      4.66.4
umap-learn                0.5.6
tuanpham96 commented 1 month ago

Update: also tested with singularity and had the same error

# build image & bind path
singularity build pyscenic.sif docker://aertslab/pyscenic_scanpy:0.12.1_1.9.1
export SINGULARITY_BINDPATH="/oscar/home/$USER,/oscar/scratch/$USER,/oscar/data" # this is from our HPC's guide for binding path
# create a shell inside
singularity shell utils/pyscenic.sif

Then inside the shell I just started an ipython kernel, copied and pasted that same code. The same issues occurred.

Am I defining the right resources? There are some pages in the resources URL that are indicated as deprecated but I'm not entirely sure which ones to change them to.

ghuls commented 1 month ago

Run the command line version and not the notebook version: https://pyscenic.readthedocs.io/en/latest/installation.html#docker-podman-and-singularity-apptainer-images

tuanpham96 commented 4 weeks ago

I'm using the singularity image with the CLI and it seems to be stuck at ctx step for > 2 hrs without finishing. I'm using --mode "custom_multiprocessing" --num_workers 40. Is that typical?

tuanpham96 commented 4 weeks ago

nevermind, based on reading other issues it seems to be I need more RAM and less number of cores. I did 20 cores + 200 gb and it seems to finish within 20 - 25 minutes using the singularity image with "dask_multiprocessing".

Is there a guide about suggested minimum RAM + # cores for each step, given some number of genes / cells / databases?