NotImplementedError: The Cell2location model currently does not support minified data.

forrwill commented 1 year ago

with stereoseq and single cell raw counts data, I got the error: NotImplementedError: The Cell2location model currently does not support minified data., But the data I use is the raw counts

Please use the template below to post a question to https://discourse.scverse.org/c/ecosytem/cell2location/.

Problem

...

[ ] I follow the instructions from the cell2location tutorial (using on scvi-tools).
[ ] I have adjusted required hyperparameters to my dataset and tissue N_cells_per_location and detection_alpha.
[ ] I have provided 10X reaction/inlet as batch_key for reference NB regression.
[ ] I have checked scverse Discourse and old Cell2location Community Forum, and did not find a solution.

Description of the data input and hyperparameters

...

Single cell reference data: number of cells, number of cell types, number of genes

...

Single cell reference data: technology type (e.g. mix of 10X 3' and 5')

...

Spatial data: number of locations numbers, technology type (e.g. Visium, ISS, Nanostring WTA)

...

vitkl commented 1 year ago

Please provide more details, exact error message, package versions.

On Fri, 24 Feb 2023 at 03:21, Forward @.***> wrote:

with stereoseq and single cell raw counts data, I got the error: NotImplementedError: The Cell2location model currently does not support minified data., But the data I use is the raw counts Please use the template below to post a question to https://discourse.scverse.org/c/ecosytem/cell2location/. Problem

...

I follow the instructions from the cell2location tutorial (using on scvi-tools) https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html .

I have adjusted required hyperparameters to my dataset and tissue N_cells_per_location and detection_alpha.

I have provided 10X reaction/inlet as batch_key for reference NB regression.

I have checked scverse Discourse https://discourse.scverse.org/c/ecosytem/cell2location/ and old Cell2location Community Forum https://github.com/BayraktarLab/cell2location/discussions, and did not find a solution.

Description of the data input and hyperparameters

...

... Single cell reference data: number of cells, number of cell types, number of genes

... Single cell reference data: technology type (e.g. mix of 10X 3' and 5')

... Spatial data: number of locations numbers, technology type (e.g. Visium, ISS, Nanostring WTA)

...

— Reply to this email directly, view it on GitHub https://github.com/BayraktarLab/cell2location/issues/253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5FF3D5XHDHKVEU3J3WZASMBANCNFSM6AAAAAAVGMV5AM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

forrwill commented 1 year ago

Package Version

absl-py 1.4.0 aiohttp 3.8.4 aiosignal 1.3.1 anndata 0.8.0 async-timeout 4.0.2 attrs 22.2.0 brotlipy 0.7.0 cached-property 1.5.2 cell2location 0.1.3 certifi 2022.12.7 cffi 1.15.0 charset-normalizer 2.0.4 chex 0.1.6 colorama 0.4.4 conda 22.11.1 conda-content-trust 0+unknown conda-package-handling 1.8.1 contextlib2 21.6.0 contourpy 1.0.7 cryptography 36.0.0 cycler 0.11.0 dm-tree 0.1.8 docrep 0.3.2 et-xmlfile 1.1.0 etils 1.0.0 flax 0.6.4 fonttools 4.38.0 frozenlist 1.3.3 fsspec 2023.1.0 h5py 3.8.0 idna 3.3 igraph 0.10.4 importlib-resources 5.12.0 jax 0.4.4 jaxlib 0.4.4 joblib 1.2.0 kiwisolver 1.4.4 leidenalg 0.9.1 lightning-utilities 0.7.0 llvmlite 0.39.1 markdown-it-py 2.1.0 matplotlib 3.7.0 mdurl 0.1.2 ml-collections 0.1.1 msgpack 1.0.4 mudata 0.2.1 multidict 6.0.4 multipledispatch 0.6.0 natsort 8.2.0 networkx 3.0 numba 0.56.4 numpy 1.23.5 numpyro 0.11.0 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 opencv-python 4.7.0.68 openpyxl 3.1.1 opt-einsum 3.3.0 optax 0.1.4 orbax 0.1.2 packaging 23.0 pandas 1.5.3 patsy 0.5.3 Pillow 9.4.0 pip 21.2.4 pluggy 1.0.0 pycosat 0.6.3 pycparser 2.21 Pygments 2.14.0 pynndescent 0.5.8 pyOpenSSL 22.0.0 pyparsing 3.0.9 pyro-api 0.1.2 pyro-ppl 1.8.4 PySocks 1.7.1 python-dateutil 2.8.2 python-igraph 0.10.4 pytorch-lightning 1.9.2 pytz 2022.7.1 PyYAML 6.0 requests 2.27.1 rich 13.3.1 ruamel.yaml 0.17.21 ruamel.yaml.clib 0.2.7 ruamel-yaml-conda 0.15.100 scanpy 1.9.2 scikit-learn 1.2.1 scipy 1.10.1 scvi-tools 0.20.1 seaborn 0.12.2 session-info 1.0.0 setuptools 61.2.0 six 1.16.0 statsmodels 0.13.5 stdlib-list 0.8.0 tensorstore 0.1.32 texttable 1.6.7 threadpoolctl 3.1.0 toolz 0.12.0 torch 1.13.1 torchmetrics 0.11.1 tqdm 4.63.0 typing_extensions 4.5.0 umap-learn 0.5.3 urllib3 1.26.8 wheel 0.37.1 yarl 1.8.2 zipp 3.14.0

adamgayoso commented 1 year ago

Can you provide more info on where you downloaded this data?

forrwill commented 1 year ago

my adata was created by gem file. I transformed it to a adata file. and the adata.X is the raw matrix. I don't know what is wrong? and another question is, what is the meaning of minified data? Traceback (most recent call last): File "/cell2loc/cell2loc_mapping.py", line 130, in mod = cell2location.models.Cell2location( /soft/Miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/cell2location/models/_cell2location_model.py", line 75, in init super().init(adata) File "/soft/Miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/model/base/_b raise NotImplementedError( NotImplementedError: The Cell2location model currently does not support minified data.

adamgayoso commented 1 year ago

Can you please provide a reproducible example of your code? and the full traceback?

forrwill commented 1 year ago

my code is Refer to the tutorial to run the pipeline, the tutorial is in https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html#Cell2location:-spatial-mapping. The difference is tutorial data is 10x space, and My data is stereoseq. The code was error with

```python mod = cell2location.models.Cell2location( adata_vis, cell_state_df=inf_aver, # the expected average cell abundance: tissue-dependent # hyper-prior which can be estimated from paired histology: N_cells_per_location=10, # hyperparameter controlling normalisation of # within-experiment variation in RNA detection: detection_alpha=20 ) import sys,os import scanpy as sc import anndata import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl from collections import defaultdict as dt import cell2location import scvi from matplotlib import rcParams rcParams['pdf.fonttype'] = 42 # enables correct plotting of text for PDFs if len(sys.argv) !=7: print(f"python %s " % sys.argv[0]) sys.exit(0) results_folder = sys.argv[1] adata_vis = sc.read_h5ad(sys.argv[2]) ## spatial h5ad adata_ref = sc.read_h5ad(sys.argv[3]) ## single cell h5ad target = sys.argv[4] adata_ref.__dict__['_raw'].__dict__['_var'] = adata_ref.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'}) adata_ref.raw.var.index = adata_ref.raw.var["features"] adata_ref = adata_ref.raw.to_adata() adata_ref.obs["region"] = adata_ref.obs["region"].astype("category") adata_ref.obs["batch"] = adata_ref.obs["batch"].astype("category") adata_ref.obs["Time"] = adata_ref.obs["Time"].astype("category") adata_ref.obs[target] = adata_ref.obs[target].astype("category") sample = sys.argv[5] ##sample_name mod_name = sys.argv[6] ##mod name saved # create paths and names to results folders for reference regression and cell2location models ref_run_name = f'{results_folder}/reference_signatures' run_name = f'{results_folder}/cell2location_map' if not os.path.exists(ref_run_name): os.makedirs(ref_run_name) if not os.path.exists(run_name): os.makedirs(run_name) adata_vis.uns = dt(dict) adata_vis.uns['spatial']["sample"] = sample # find mitochondria-encoded (MT) genes adata_vis.var['MT_gene'] = [gene.startswith('mt-') for gene in adata_vis.var_names] # remove MT genes for spatial mapping (keeping their counts in the object) adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray() adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values] adata_ref.var['SYMBOL'] = adata_ref.var_names # rename 'GeneID-2' as necessary for your data #adata_ref.var.set_index('GeneID-2', drop=True, inplace=True) from cell2location.utils.filtering import filter_genes selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12) print(selected) # filter the object #adata_ref = adata_ref[:, selected].copy() adata_ref = adata_ref[:, selected] #adata_ref.write("adata_ref.h5ad") # prepare anndata for the regression model cell2location.models.RegressionModel.setup_anndata(adata=adata_ref, # 10X reaction / sample / batch batch_key='batch', # cell type, covariate used for constructing signatures labels_key=target, # multiplicative technical effects (platform, 3' vs 5', donor effect) categorical_covariate_keys=['region'] ) # create the regression model from cell2location.models import RegressionModel mod = RegressionModel(adata_ref) # view anndata_setup as a sanity check mod.view_anndata_setup() mod.train(max_epochs=250, use_gpu=True) mod.plot_history(20) plt.savefig(f"{results_folder}/mod.plot_history.pdf") # In this section, we export the estimated cell abundance (summary of the posterior distribution). adata_ref = mod.export_posterior( adata_ref, sample_kwargs={'num_samples': 2000, 'batch_size': 2000, 'use_gpu': True} ) # Save model mod.save(f"{mod_name}", overwrite=True) adata_file = f"{ref_run_name}/sc.h5ad" adata_ref.write(adata_file) mod.plot_QC() plt.savefig(f"{results_folder}/mod.plot_QC.pdf") # export estimated expression in each cluster if 'means_per_cluster_mu_fg' in adata_ref.varm.keys(): inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}' for i in adata_ref.uns['mod']['factor_names']]].copy() else: inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}' for i in adata_ref.uns['mod']['factor_names']]].copy() inf_aver.columns = adata_ref.uns['mod']['factor_names'] inf_aver.to_csv(f"{results_folder}/inf_aver.txt", index=True, header=True) ##Cell2location：空间映射 # find shared genes and subset both anndata and reference signatures intersect = np.intersect1d(adata_vis.var_names, inf_aver.index) print(intersect[1:10]) adata_vis = adata_vis[:, intersect].copy() adata_vis.write("sp.h5ad") inf_aver = inf_aver.loc[intersect, :].copy() # prepare anndata for cell2location model cell2location.models.Cell2location.setup_anndata(adata=adata_vis, batch_key="orig.ident") # create and train the model mod = cell2location.models.Cell2location( adata_vis, cell_state_df=inf_aver, # the expected average cell abundance: tissue-dependent # hyper-prior which can be estimated from paired histology: N_cells_per_location=10, # hyperparameter controlling normalisation of # within-experiment variation in RNA detection: detection_alpha=20 ) ```

vitkl commented 1 year ago

What is minified data?

@forrwill I can recommend trying to make sure that the adata.X or adata.layers["whatever slot you are using"] is scipy.sparse.csr_matrix and data type is "float32":

adata.X = scipy.sparse.csr_matrix(adata.X, dtype="float32")

forrwill commented 1 year ago

Thank you, I will check it.

adamgayoso commented 1 year ago

@forrwill can you provide a full traceback of the error?

forrwill commented 1 year ago

Do you mean the log file? or the input data I use. I set it in "adata.X = scipy.sparse.csr_matrix(adata.X, dtype="float32")", But the error still exists

forrwill commented 1 year ago

          extra_categorical_covs State Registry
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃  Source Location   ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['batch'] │   W   │          0          │
│                    │  XC   │          1          │
│                    │            │                     │
└────────────────────┴────────────┴─────────────────────┘
/soft/Miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning_fabric/plugins/e
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/soft/Miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning_fabric/plugins/e
  rank_zero_warn(
/soft/Miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pytorch_lightning/trainer/
  rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.")
You are using a CUDA device ('NVIDIA A800 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]
/soft/Miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pytorch_lightning/trainer/
  rank_zero_warn(
^MTraining:   0%|                                                                                                       | 0/250 [00:00<?, ?it/s
^MEpoch 250/250: 100%|███████████████████████████████████████████████████████████| 250/250 [03:39<00:00,  1.14it/s, v_num=1, elbo_train=1.62e+8
^MSampling local variables, batch:   0%|                                                                                 | 0/75 [00:00<?, ?it/s
^MSampling global variables, sample:   0%|                                                                              | 0/199 [00:00<?, ?it/s
Traceback (most recent call last):
  File "./cell2loc_mapping.py", line 135, in <module>
    mod = cell2location.models.Cell2location(
  File "/soft/Miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/cell2location/mode
    super().__init__(adata)
  File "/soft/Miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/model/base/_b
    raise NotImplementedError(
NotImplementedError: The Cell2location model currently does not support minified data.

vitkl commented 1 year ago

Which model are you attempting to train? The error message suggests cell2location.models.Cell2location but the number of epochs is for the regression model.

We strongly don't recommend minibatch training (batch_size=number) for cell2location.models.Cell2location because it gives lower accuracy and requires extremely long training to achieve decent results. You have a fairly large GPU so please try using full data training (batch_size=None).

If you really need and would like to try limiting batch_size, you need to use our experimental amortised inference approach which uses a neural network to approximate cell abundance. This approach is generally less sensitive (especially low count data such as Stereoseq) - but on good quality data (such as human lymph node and mouse brain used in cell2location paper) it can give very similar results to our preferred approach. You can try aggregating Stereoseq proximal locations to get higher data quality. See here for the required settings https://github.com/BayraktarLab/cell2location/discussions/264#discussioncomment-5341068 and please post the exact code you will use here to make sure this approach is used correctly.

It could be possible that cell2location.models.Cell2location doesn't support minified data - but I don't know what minified data is.

Just a tip: wrap your code into backticks to display it nicely: "```python"

"```"

BayraktarLab / cell2location