Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
254 stars 40 forks source link

Invalid expression matrix in `.X`, expect log1p normalized expression to 10000 counts per cell; will try the `.raw` attribute #83

Closed hyjforesight closed 9 months ago

hyjforesight commented 10 months ago

Hi Celltypist, Thanks for developing this amazing package!

First, I proceed the data by Scanpy and Harmony with log-normalization.

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
sc.pl.highly_variable_genes(adata)
adata.raw=adata
adata = adata[:, adata.var.highly_variable]

sc.pp.regress_out(adata, keys=['total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16)
sc.pp.scale(adata, max_value=10)

sc.tl.pca(adata, svd_solver='arpack')
sce.pp.harmony_integrate(adata, key='batch', basis='X_pca', adjusted_basis='X_pca_harmony')
adata.obsm['X_pca']=adata.obsm['X_pca_harmony']
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, knn=True)
sc.tl.leiden(adata, resolution=1)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden'], legend_loc='right margin', frameon=False, title='', use_raw=True, save='1.pdf')

adata.write('C:/Users/hyjfo/Documents/integration_fx_adata.h5ad', compression='gzip')

Then, I load the saved h5ad file and run Celltypist, but I met the error: Invalid expression matrix in .X, expect log1p normalized expression to 10000 counts per cell; will try the .raw attribute`.

models.download_models(force_update = True)
adata = sc.read('C:/Users/hyjfo/Documents/integration_fx_adata.h5ad')
adata
AnnData object with n_obs × n_vars = 56908 × 5000
    obs: 'batch', 'type', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps', 'leiden'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'batch_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'type_colors', 'umap'
    obsm: 'X_pca', 'X_pca_harmony', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

model= models.Model.load(model = 'Immune_All_High.pkl')
model.cell_types
predictions= celltypist.annotate(adata, model = 'Immune_All_High.pkl', mode = 'best match', p_thres = 0.5, majority_voting = True)
👀 Invalid expression matrix in `.X`, expect log1p normalized expression to 10000 counts per cell; will try the `.raw` attribute
🔬 Input data has 56908 cells and 20466 genes
🔗 Matching reference genes in the model
🧬 6 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
⛓️ Over-clustering input data with resolution set to 20
🗳️ Majority voting the predictions
✅ Majority voting done!

The data has been log-normalized by 'Scanpy', right? How does this error happen?

Thanks in advance for all the kind help. Best, Yuanjian

ChuanXu1 commented 10 months ago

@hyjforesight, CellTypist needs the all-cell-by-all-gene matrix in a log normalised format in either .X or .raw.X. Since your .X is a scaled data (with negative values), CellTypist finally uses the .raw.X for prediction as an alternative. Btw, seems your .var_names are not gene symbols, as only 6 features overlap with the model.

cmf1997 commented 10 months ago

it seems that using sc.pp.regress_out may cause error adataConcat = sc.read_h5ad("***') sc.pp.normalize_total(adataConcat, target_sum=1e4) sc.pp.log1p(adataConcat) sc.pp.regress_out(adataConcat, ['total_counts', 'pct_counts_mt']) predictions = celltypist.annotate(adataConcat, model = 'Healthy_COVID19_PBMC.pkl', majority_voting = True) predictions.predicted_labels adata = predictions.to_adata() error

Invalid expression matrix in .X, expect log1p normalized expression to 10000 counts per cell; will try the .raw attribute Traceback (most recent call last): File "/lustre/home/acct-medzy/medzy-cai/.conda/envs/scRNA-cmf/lib/python3.9/site-packages/celltypist/classifier.py", line 307, in init self.indata = self.adata.raw.X AttributeError: 'NoneType' object has no attribute 'X'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/lustre/home/acct-medzy/medzy-cai/.conda/envs/scRNA-cmf/lib/python3.9/site-packages/celltypist/annotate.py", line 79, in annotate clf = classifier.Classifier(filename = filename, model = lr_classifier, transpose = transpose_input, gene_file = gene_file, cell_file = cell_file) File "/lustre/home/acct-medzy/medzy-cai/.conda/envs/scRNA-cmf/lib/python3.9/site-packages/celltypist/classifier.py", line 311, in init raise Exception( Exception: � Fail to use the .raw attribute in the input object. 'NoneType' object has no attribute 'X'

ChuanXu1 commented 10 months ago

@cmf1997, you can skip this command sc.pp.regress_out(adataConcat, ['total_counts', 'pct_counts_mt']) for CellTypist prediction purpose as it yields negative values

cmf1997 commented 9 months ago

@ChuanXu1 exactly do you recommend using regress out again after celltype prediction? now i skip regress and using bbknn to integrate multiple data

ChuanXu1 commented 9 months ago

@cmf1997, CellTypist replies on log normalised expression (to 10,000). If you regress out some covariates, the format will not suffice. After CellTypist prediction using the log normalised data, you will get additional prediction-related columns in the .obs of the AnnData. Then you can do whatever you want to do for the downstream analyses, such as regressing out batches, highly variable gene selection, etc.

hyjforesight commented 9 months ago

@hyjforesight, CellTypist needs the all-cell-by-all-gene matrix in a log normalised format in either .X or .raw.X. Since your .X is a scaled data (with negative values), CellTypist finally uses the .raw.X for prediction as an alternative. Btw, seems your .var_names are not gene symbols, as only 6 features overlap with the model.

Hello Chuan, Thanks for the response. So the best input data for Celltypist is the log-transformed raw matrix before scaling?所以我只需要加载10X matrix,做简单的质控,去掉一些不要的细胞,然后只跑Scanpy里面的sc.pp.normalize_total(adata, target_sum=1e4) ,接着就跑CellTypist就行了,是啊?

Thank you! Best, Yuanjian

ChuanXu1 commented 9 months ago

@hyjforesight, CellTypist needs the all-cell-by-all-gene matrix in a log normalised format in either .X or .raw.X. Since your .X is a scaled data (with negative values), CellTypist finally uses the .raw.X for prediction as an alternative. Btw, seems your .var_names are not gene symbols, as only 6 features overlap with the model.

Hello Chuan, Thanks for the response. So the best input data for Celltypist is the log-transformed raw matrix before scaling?所以我只需要加载10X matrix,做简单的质控,去掉一些不要的细胞,然后只跑Scanpy里面的sc.pp.normalize_total(adata, target_sum=1e4) ,接着就跑CellTypist就行了,是啊?

Thank you! Best, Yuanjian

Yes. sc.pp.normalize_total(adata, target_sum=1e4) -> sc.pp.log1p(adata) -> CellTypist run

hyjforesight commented 9 months ago

Thank you @ChuanXu1 . I close this issue.

woloorn commented 6 months ago

Same issue. I'v fixed it by converting the adata.X from np.float32 to np.float64 format. It seems celltypist doesnt accept float32?

ChuanXu1 commented 6 months ago

Same issue. I'v fixed it by converting the adata.X from np.float32 to np.float64 format. It seems celltypist doesnt accept float32?

@woloorn, I don't think float32 will cause any problem for CellTypist. Does this happen for the newest version of CellTypist?