Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
284 stars 45 forks source link

Error running celltypist #86

Closed malonzm1 closed 10 months ago

malonzm1 commented 1 year ago

Hi,

I tried using celltypist with the following code:

adata = sc.read_h5ad(filename='%s/GSE137029.h5ad'%infolder)
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
predictions = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True)

But it returned the following error:

raise ValueError( ValueError: � Invalid expression matrix in both .X and .raw.X, expect log1p normalized expression to 10000 counts per cell

I tried with a smaller dataset and it worked but not with the bigger dataset. Please advise.

Thanks and good day.

ChuanXu1 commented 1 year ago

@malonzm1, can you show the result of adata.X.data.max()?

ChuanXu1 commented 1 year ago

New version (1.6.1) should have fixed this.

malonzm1 commented 1 year ago

Thanks!

malonzm1 commented 1 year ago

It still says the following: WARNING:celltypist.logger:⚠️ Warning: invalid expression matrix, expect all genes and log1p normalized expression to 10000 counts per cell. The prediction result may not be accurate

ChuanXu1 commented 12 months ago

@malonzm1, can you show the shape of the data (adata.shape), and the result of adata.X.expm1().sum(axis=1).min() and adata.X.expm1().sum(axis=1).max()

malonzm1 commented 12 months ago

adata.shape (3535249, 19494) adata.X.expm1().sum(axis=1).min() 9999.994 adata.X.expm1().sum(axis=1).max() 10000.007

ChuanXu1 commented 11 months ago

@malonzm1, that's weird. Did you slice the data (genes) before prediction? Could you put all code here reproducing the warning message above?

malonzm1 commented 11 months ago

The warning message is: WARNING:celltypist.logger:⚠️ Warning: invalid expression matrix, expect all genes and log1p normalized expression to 10000 counts per cell. The prediction result may not be accurate

The code is:

import scanpy as sc
import pandas as pd
import scvi
from glob import glob
import os
import celltypist
from celltypist import models

infolder = '/scratch/cs/pan-autoimmune/data/scvi/10x'
os.chdir(infolder)
adata = sc.read_h5ad(filename='%s/10x.h5ad'%infolder)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars = ['mt'], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs.pct_counts_mt < 15]
sc.pp.filter_genes(adata, min_counts=3)
sc.pp.filter_genes(adata, min_cells = 3)
sc.pp.filter_cells(adata, min_genes = 200)
sc.pp.filter_cells(adata, min_counts = 200)
sc.pp.normalize_total(adata, target_sum=1e4)
adata.layers["counts"] = adata.X.copy()
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="gse",
)
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    categorical_covariate_keys=["gse"],
    continuous_covariate_keys=['pct_counts_mt', 'total_counts']
    #continuous_covariate_keys=["percent_mito", "percent_ribo"],
)
models.download_models(force_update = True)
predictions = celltypist.annotate(adata, model = 'Immune_All_High.pkl', majority_voting = True)
adata = predictions.to_adata()
ChuanXu1 commented 11 months ago

@malonzm1, you specified subset=True in sc.pp.highly_variable_genes, which means only a subset of genes (here 1200) can be found in adata.X. That's why a warning is raised because CellTypist expect all genes (for maximalising the overlap between the model and the query data) rather than only a few genes.

Btw, I think you need to put adata.layers["counts"] = adata.X.copy() before sc.pp.normalize_total(adata, target_sum=1e4).

ChuanXu1 commented 10 months ago

Will close this issue. Please re-open it if you have further questions.