KeyError 'Field <GENE> does not exist in schema' at `prune2df` step for tutorial

tuanpham96 commented 1 day ago

Description

I'm running the tutorial and I keep getting the errors at the prune2df step like this:

Exception: 'KeyError(\'Field "Snora5c" exists 2 times in schema\')'
[...]
Exception: 'KeyError(\'Field "1600002H07Rik" does not exist in schema\')'

Steps to reproduce the behavior

Command run when the error occurred:

Import & Define resources

```python # import import os import sys import glob import re import numpy as np import pandas as pd from dask.diagnostics import ProgressBar from dask.distributed import Client, LocalCluster from arboreto.utils import load_tf_names from arboreto.algo import grnboost2, genie3 from ctxcore.rnkdb import FeatherRankingDatabase as RankingDatabase from pyscenic.utils import modules_from_adjacencies, load_motifs from pyscenic.prune import prune2df, df2regulons from pyscenic.aucell import aucell # define paths OUTPUT_DATA_FOLDER = "data/grn" INPUT_EXPR_FILE = 'data/external/geo/GSE60361_C1-3005-Expression.txt' RESOURCES_DIRECTORY = "data/external/aertslab/resources.aertslab.org/cistarget" DATABASES_GLOB = os.path.join( RESOURCES_DIRECTORY, "databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/", "mm9-*.mc9nr.genes_vs_motifs.rankings.feather" ) MOTIF_ANNOTATIONS_FNAME = os.path.join( RESOURCES_DIRECTORY, "motif2tf/motifs-v9-nr.mgi-m0.001-o0.0.tbl" ) MM_TFS_FNAME = os.path.join( RESOURCES_DIRECTORY, "tf_lists/allTFs_mm.txt" ) REGULONS_FNAME = os.path.join(OUTPUT_DATA_FOLDER, "regulons.p") MOTIFS_FNAME = os.path.join(OUTPUT_DATA_FOLDER, "motifs.csv") ``` Here's what the resource directory looks like: ``` data/external/aertslab/resources.aertslab.org/cistarget ├── databases │ └── mus_musculus │ ├── mm10 │ │ ├── refseq_r80 │ │ │ ├── mc9nr │ │ │ │ └── gene_based │ │ │ └── mc_v10_clust │ │ │ └── gene_based │ │ └── screen │ │ └── mc_v10_clust │ │ └── region_based │ └── mm9 │ ├── refseq_r45 │ │ └── mc9nr │ │ └── gene_based │ │ ├── mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather │ │ ├── mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ │ ├── mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather │ │ └── mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt │ └── refseq_r70 │ └── mc9nr │ └── region_based ├── motif2tf │ ├── motifs-v10nr_clust-nr.chicken-m0.001-o0.0.tbl │ ├── motifs-v10nr_clust-nr.flybase-m0.001-o0.0.tbl │ ├── motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl │ ├── motifs-v10nr_clust-nr.mgi-m0.001-o0.0.tbl │ ├── motifs-v8-nr.flybase-m0.001-o0.0.tbl │ ├── motifs-v9-nr.flybase-m0.001-o0.0.tbl │ ├── motifs-v9-nr.hgnc-m0.001-o0.0.tbl │ └── motifs-v9-nr.mgi-m0.001-o0.0.tbl └── tf_lists ├── allTFs_dmel.txt ├── allTFs_hg38.txt └── allTFs_mm.txt ```

Load data

```python ex_matrix = pd.read_csv(INPUT_EXPR_FILE, sep='\t', header=0, index_col=0).T ex_matrix.shape --- (3005, 19972) ``` ```python tf_names = load_tf_names(MM_TFS_FNAME) len(tf_names) --- 1860 ``` ```python db_fnames = glob.glob(DATABASES_GLOB) def name(fname): return os.path.splitext(os.path.basename(fname))[0] dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames] dbs --- [FeatherRankingDatabase(name="mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings")] ```

Then the steps as in the tutorials:

adjacencies = grnboost2(
    ex_matrix, 
    tf_names=tf_names, 
    verbose=True
)

modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

The above steps worked fine. Then to prune2df, which didn't work:

Since I'm running on university HPC, I followed this comment:

with ProgressBar():
    df = prune2df(
        dbs, modules, MOTIF_ANNOTATIONS_FNAME,
        client_or_address=Client(LocalCluster())
    )

Error encountered:

Here's a snippet of the trace back:


KeyError: 'Field "1810058I24Rik" does not exist in schema'
2024-10-23 12:20:05,208 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 44894)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7f2faeb34430>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7f2faeb34c10>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Zmat2', gene2weight=frozendict.frozendict({'0610009L18Rik': 0.24334112928456972, '0610010F05Rik': 0.12469187083208143, '0610011F06Rik': 0.07426427896756556, '0610040B10Rik': 0.5023871257709747, '1110002L01Rik': 0.25781507539505055, '1110004E09Rik': 2.0959058794181535, '1110004F10Rik': 0.4213780187493597, '1110008L16Rik': 0.2437094347255712, '1110008P14Rik': 1.2910006242509904, '1110032F04Rik': 0.33099966516163515, '1110035M17Rik': 1.2395167785396992, '1110037F02Rik': 0.9282107870741647, '1110038F14Rik': 1.939926191405302, '1110046J04Rik': 0.2586313695021531, 
kwargs:    {}
Exception: 'KeyError(\'Field "0610010K14Rik" does not exist in schema\')'

Full traceback

```pytb /users//data//conda-env/sequencing/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 40923 instead warnings.warn( /users//data//conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:3169: UserWarning: Sending large graph of size 67.13 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures. warnings.warn( 2024-10-23 12:20:00,160 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Snapc5 could be mapped to mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings. Skipping this module. 2024-10-23 12:20:00,317 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Zkscan8 could be mapped to mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings. Skipping this module. 2024-10-23 12:20:00,375 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Snai2 could be mapped to mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings. Skipping this module. 2024-10-23 12:20:00,558 - pyscenic.transform - WARNING - Less than 80% of the genes in Zscan4f could be mapped to mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings. Skipping this module. 2024-10-23 12:20:00,703 - distributed.worker - WARNING - Compute Failed Key: ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9993) Function: execute_task args: ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(, module2features_func=functools.partial(, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Snai3', gene2weight=frozendict.frozendict({'Nptxr': 3.71858079580093, '4930422G04Rik': 3.4189593778885565, 'Tst': 3.4019598703995646, 'Gm16982': 3.2794141345795023, 'Fam71d': 2.770652596133948, 'Trim65': 2.5165290070963446, '6430571L13Rik': 2.322653157302345, 'Snora68': 2.2677351112931254, 'Mir1983': 2.1049208171740688, '4833412C05Rik': 1.9567400648623148, 'Slfn9': 1.918912519709758, 'Myh14': 1.8155076543676392, 'Adrb3': 1.7655713128799888, 'Psd4': 1.7593517666032166, 'Snora5c': 1.742871015263215, 'Armc2': 1.722555406741925, 'Vmn2r87': 1.697990923076 kwargs: {} Exception: 'KeyError(\'Field "Snora5c" exists 2 times in schema\')' 2024-10-23 12:20:00,708 - distributed.worker - WARNING - Compute Failed Key: ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9994) Function: execute_task args: ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(, module2features_func=functools.partial(, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Snapc4', gene2weight=frozendict.frozendict({'C330006A16Rik': 4.349251118259209, 'C130060C02Rik': 4.208884115579275, 'Gm7854': 3.5410158417095827, 'Sarm1': 2.927046278423663, 'Lenep': 2.785681164068822, 'B3galt6': 2.6131502759106913, 'Gm11202': 2.462704280429404, 'Sgcz': 2.3117112215460534, 'Lrpprc': 1.9038514495386707, 'Nsun7': 1.8296346841767743, 'Gpsm2': 1.8010605529771901, 'Pcnxl3': 1.7919756056901275, 'Gm5801_loc2': 1.7491126804317503, 'Wdr31': 1.570558011998353, 'Cxcl9': 1.556944185120764, 'Kctd8': 1.5382108739561857, 'Cad': 1.5234094823248008, kwargs: {} Exception: 'KeyError(\'Field "1600002H07Rik" does not exist in schema\')' [... TRUNCATED DUE TO LIMIT ON GITHUB ISSUE ...] 2024-10-23 12:20:03,485 - distributed.worker - WARNING - Compute Failed Key: ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9956) Function: execute_task args: ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(, module2features_func=functools.partial(, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Sfpq', gene2weight=frozendict.frozendict({'Erdr1': 3.0321972987058734, 'Zc3h11a': 2.691669912271372, 'Fnbp4': 2.6088463086281473, '1110037F02Rik': 2.0684340440320277, 'Tdrd7': 2.0345506791813444, 'Polr1a': 2.010575150065573, 'Tnrc6c': 1.960275733025441, 'Nup155': 1.8890378889442363, 'Crebzf': 1.8710276416734972, 'Arid2': 1.8643832712747197, '1810026B05Rik': 1.8462831579510215, 'Snrnp70': 1.844349613916035, 'Pprc1': 1.7901300300003178, 'Cntnap5a': 1.7415507727206572, '0610009O20Rik': 1.707972820514252, 'Chd9': 1.6735837131019469, 'Pkd1l3': 1.661729812 kwargs: {} Exception: 'KeyError(\'Field "0610030E20Rik" does not exist in schema\')' --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[11], line 2 1 with ProgressBar(): ----> 2 df = prune2df( 3 dbs, modules, MOTIF_ANNOTATIONS_FNAME, 4 client_or_address=Client(LocalCluster()) 5 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/prune.py:424, in prune2df(rnkdbs, modules, motif_annotations_fname, rank_threshold, auc_threshold, nes_threshold, motif_similarity_fdr, orthologuous_identity_threshold, weighted_recovery, client_or_address, num_workers, module_chunksize, filter_for_annotation) 418 # Create a distributed dataframe from individual delayed objects to avoid out of memory problems. 419 aggregation_func = ( 420 partial(from_delayed, meta=DF_META_DATA) 421 if client_or_address != "custom_multiprocessing" 422 else pd.concat 423 ) --> 424 return _distributed_calc( 425 rnkdbs, 426 modules, 427 motif_annotations_fname, 428 transformation_func, 429 aggregation_func, 430 motif_similarity_fdr, 431 orthologuous_identity_threshold, 432 client_or_address, 433 num_workers, 434 module_chunksize, 435 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/prune.py:362, in _distributed_calc(rnkdbs, modules, motif_annotations_fname, transform_func, aggregate_func, motif_similarity_fdr, orthologuous_identity_threshold, client_or_address, num_workers, module_chunksize) 357 client, shutdown_callback = _prepare_client( 358 client_or_address, 359 num_workers=num_workers if num_workers else cpu_count(), 360 ) 361 try: --> 362 return client.compute(create_graph(client), sync=True) 363 finally: 364 shutdown_callback(False) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:3502, in Client.compute(self, collections, sync, optimize_graph, workers, allow_other_workers, resources, retries, priority, fifo_timeout, actors, traverse, **kwargs) 3499 futures.append(arg) 3501 if sync: -> 3502 result = self.gather(futures) 3503 else: 3504 result = futures File ~/data//conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:2384, in Client.gather(self, futures, errors, direct, asynchronous) 2381 local_worker = None 2383 with shorten_traceback(): -> 2384 return self.sync( 2385 self._gather, 2386 futures, 2387 errors=errors, 2388 direct=direct, 2389 local_worker=local_worker, 2390 asynchronous=asynchronous, 2391 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:368, in modules2df() 356 def modules2df( 357 db: Type[RankingDatabase], 358 modules: Sequence[Regulon], (...) 365 # to be fixed for the dask framework. 366 # TODO: Remove this restriction. 367 return pd.concat( --> 368 [ 369 module2df( 370 db, 371 module, 372 motif_annotations, 373 weighted_recovery, 374 False, 375 module2features_func, 376 ) 377 for module in modules 378 ] 379 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:369, in () 356 def modules2df( 357 db: Type[RankingDatabase], 358 modules: Sequence[Regulon], (...) 365 # to be fixed for the dask framework. 366 # TODO: Remove this restriction. 367 return pd.concat( 368 [ --> 369 module2df( 370 db, 371 module, 372 motif_annotations, 373 weighted_recovery, 374 False, 375 module2features_func, 376 ) 377 for module in modules 378 ] 379 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:287, in module2df() 285 # Derive enriched and TF-annotated features for module. 286 try: --> 287 df_annotated_features, rccs, rankings, genes, avg2stdrcc = module2features_func( 288 db, module, motif_annotations, weighted_recovery=weighted_recovery 289 ) 290 except MemoryError: 291 LOGGER.error( 292 'Unable to process "{}" on database "{}" because ran out of memory. Stacktrace:'.format( 293 module.name, db.name 294 ) 295 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:176, in module2features_auc1st_impl() 162 """ 163 Create a dataframe of enriched and annotated features a given ranking database and a co-expression module. 164 (...) 172 :return: A dataframe with enriched and annotated features. 173 """ 175 # Load rank of genes from database. --> 176 df = db.load(module) 177 features, genes, rankings = df.index.values, df.columns.values, df.values 178 weights = ( 179 np.asarray([module[gene] for gene in genes]) 180 if weighted_recovery 181 else np.ones(len(genes)) 182 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/ctxcore/rnkdb.py:132, in load() 128 def load(self, gs: GeneSignature) -> pd.DataFrame: 129 # For some genes in the signature there might not be a rank available in the database. 130 gene_set = self.geneset.intersection(set(gs.genes)) --> 132 return self.ct_db.subset_to_pandas( 133 region_or_gene_ids=RegionOrGeneIDs( 134 region_or_gene_ids=gene_set, 135 regions_or_genes_type=self.ct_db.all_region_or_gene_ids.type, 136 ) 137 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:789, in subset_to_pandas() 785 engine = engine if engine else self.engine 787 # Fetch scores or rankings for input region IDs or gene IDs from cisTarget database file for region IDs or 788 # gene IDs which were not prefetched in previous calls. --> 789 self.prefetch(region_or_gene_ids=region_or_gene_ids, engine=engine, sort=True) 791 if not self.df_cached: 792 raise RuntimeError( 793 f"Prefetch failed to retrieve {self.scores_or_rankings} for " 794 f"{region_or_gene_ids} from cisTarget database " 795 f'"{self.ct_db_filename}".' 796 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:739, in prefetch() 734 self._prefetch_as_polars_dataframe( 735 region_or_gene_ids=region_or_gene_ids, use_pyarrow=True, sort=sort 736 ) 737 elif engine == "pyarrow": 738 # Store prefetched data as pyarrow Table (self.df_cached) and read data with pyarrow's native IPC reader. --> 739 self._prefetch_as_pyarrow_table( 740 region_or_gene_ids=region_or_gene_ids, sort=sort 741 ) 742 else: 743 raise ValueError( 744 f'Unsupported engine "{engine}" for reading cisTarget database.' 745 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:678, in _prefetch_as_pyarrow_table() 673 self.region_or_gene_ids_loaded = found_region_or_gene_ids.union( 674 self.region_or_gene_ids_loaded 675 ) 677 # Store new pyarrow Table with previously and newly loaded region IDs or gene IDs scores/rankings. --> 678 self.df_cached = pa_table.select( 679 ( 680 self.region_or_gene_ids_loaded.sort().ids 681 if sort 682 else self.region_or_gene_ids_loaded.ids 683 ) 684 + (self.all_motif_or_track_ids.type.value,) 685 ) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyarrow/table.pxi:4207, in pyarrow.lib.Table.select() 4205 4206 for idx in columns: -> 4207 idx = self._ensure_integer_index(idx) 4208 idx = _normalize_index(idx, self.num_columns) 4209 c_indices.push_back( idx) File ~/data//conda-env/sequencing/lib/python3.10/site-packages/pyarrow/table.pxi:1668, in pyarrow.lib._Tabular._ensure_integer_index() 1666 1667 if len(field_indices) == 0: -> 1668 raise KeyError("Field \"{}\" does not exist in schema" 1669 .format(i)) 1670 elif len(field_indices) > 1: KeyError: 'Field "1810058I24Rik" does not exist in schema' 2024-10-23 12:20:05,208 - distributed.worker - WARNING - Compute Failed Key: ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 44894) Function: execute_task args: ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(, module2features_func=functools.partial(, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Zmat2', gene2weight=frozendict.frozendict({'0610009L18Rik': 0.24334112928456972, '0610010F05Rik': 0.12469187083208143, '0610011F06Rik': 0.07426427896756556, '0610040B10Rik': 0.5023871257709747, '1110002L01Rik': 0.25781507539505055, '1110004E09Rik': 2.0959058794181535, '1110004F10Rik': 0.4213780187493597, '1110008L16Rik': 0.2437094347255712, '1110008P14Rik': 1.2910006242509904, '1110032F04Rik': 0.33099966516163515, '1110035M17Rik': 1.2395167785396992, '1110037F02Rik': 0.9282107870741647, '1110038F14Rik': 1.939926191405302, '1110046J04Rik': 0.2586313695021531, kwargs: {} Exception: 'KeyError(\'Field "0610010K14Rik" does not exist in schema\')' ```

Please complete the following information:

pySCENIC version: due to the current numpy issue, I installed via pip git+...
Installation method: first created a conda environment (Python 3.10.14), then pip git+...
Run environment: Jupyter Notebook on university HPC (1 node, 40 cores, 120g)
OS: Linux (I believe the HPC uses RHEL/9.2)
Package versions:

aiohttp                   3.10.0
anndata                   0.10.8
arboreto                  0.1.6
arrow                     1.3.0
attrs                     23.2.0
boltons                   24.0.0
cloudpickle               3.0.0
ctxcore                   0.2.0
cytoolz                   0.12.3
dask                      2024.2.1
dask-expr                 0.5.3
distributed               2024.2.1
feather-format            0.4.1
frozendict                2.4.4
fsspec                    2024.6.1
interlap                  0.2.7
llvmlite                  0.43.0
loompy                    3.0.7
matplotlib                3.9.2
matplotlib-inline         0.1.7
multiprocessing_on_dill   3.5.0a4
networkx                  3.3
numba                     0.60.0
numexpr                   2.10.1
numpy                     1.26.4
numpy-groupies            0.11.2
pandas                    2.2.2
pandas-flavor             0.6.0
pyarrow                   17.0.0
pyarrow-hotfix            0.6
pyscenic                  0.12.1+8.gd2309fe
requests                  2.32.3
scanpy                    1.10.2
scikit-learn              1.5.1
scipy                     1.14.0
seaborn                   0.13.2
setuptools                71.0.4
tqdm                      4.66.4
umap-learn                0.5.6

tuanpham96 commented 1 day ago

Update: also tested with singularity and had the same error

# build image & bind path
singularity build pyscenic.sif docker://aertslab/pyscenic_scanpy:0.12.1_1.9.1
export SINGULARITY_BINDPATH="/oscar/home/$USER,/oscar/scratch/$USER,/oscar/data" # this is from our HPC's guide for binding path
# create a shell inside
singularity shell utils/pyscenic.sif

Then inside the shell I just started an ipython kernel, copied and pasted that same code. The same issues occurred.

Am I defining the right resources? There are some pages in the resources URL that are indicated as deprecated but I'm not entirely sure which ones to change them to.

ghuls commented 23 hours ago

Run the command line version and not the notebook version: https://pyscenic.readthedocs.io/en/latest/installation.html#docker-podman-and-singularity-apptainer-images

tuanpham96 commented 13 hours ago

I'm using the singularity image with the CLI and it seems to be stuck at ctx step for > 2 hrs without finishing. I'm using --mode "custom_multiprocessing" --num_workers 40. Is that typical?

aertslab / pySCENIC

KeyError 'Field <GENE> does not exist in schema' at `prune2df` step for tutorial #589