biomap-research / scFoundation

Apache License 2.0
180 stars 26 forks source link

Mapping embeddings #31

Open lamasJose opened 1 week ago

lamasJose commented 1 week ago

Hi, I'd wish to replicate your mapping results but in the line assigning the embeddings to the 'obsm' in the anndata object I encounter the following error: Traceback (most recent call last): File "/home/jmlamas/integrationTests.py", line 43, in merged.obsm['scf']=scfemb File "/home/jmlamas/miniforge3/envs/scgptGPU/lib/python3.10/site-packages/anndata/_core/aligned_mapping.py", line 199, in setitem value = self._validate_value(value, key) File "/home/jmlamas/miniforge3/envs/scgptGPU/lib/python3.10/site-packages/anndata/_core/aligned_mapping.py", line 268, in _validate_value return super()._validate_value(val, key) File "/home/jmlamas/miniforge3/envs/scgptGPU/lib/python3.10/site-packages/anndata/_core/aligned_mapping.py", line 89, in _validate_value raise ValueError(msg) ValueError: Value passed for key 'scf' is of incorrect shape. Values of obsm must match dimensions ('obs',) of parent. Value had shape (10,) while it should have had (13999,).

And my script is the following: from matplotlib import rcParams import numpy as np import scanpy as sc import pandas as pd import scib import os from datetime import datetime

def subsample(adata,target_cells = 1000,cluster_key = 'cell_type_group'): adatas = [adata[adata.obs[cluster_key].isin([clust])] for clust in adata.obs[cluster_key].cat.categories]

for dat in adatas:
    if dat.n_obs > target_cells:
         sc.pp.subsample(dat, n_obs=target_cells)

adata_downsampled = adatas[0].concatenate(*adatas[1:])
return adata_downsampled

rawmerged = sc.read_h5ad('/data/rawmerged.h5ad') merged = rawmerged.copy()

outputdir = 'scFoundation/' + datetime.now().strftime('%Y%m%d%H%M%S') os.makedirs(output_dir, exist_ok=True) sc.settings.figdir = output_dir

sc.pp.highly_variable_genes(merged) sc.pp.scale(merged) sc.pp.pca(merged)

sc.external.pp.bbknn(merged, batch_key='batch_id') # running bbknn 1.3.6 sc.tl.umap(merged)

sc.settings.figdir="./figure/" rcParams['axes.spines.right'] = False rcParams['axes.spines.top'] = False rcParams['pdf.fonttype'] = 42 rcParams['ps.fonttype'] = 42

sc.pl.umap(merged,color=['cell_type_group','batch_id'],save='raw',ncols=1)

scfemb = np.load('./data/organoid_01B-resolution_singlecell_cell_embedding_t4.5_resolution.npy')

merged.obsm['scf']=scfemb scfadata = merged.copy() sc.external.pp.bbknn(scfadata, batch_key='batch_id',use_rep='scf',n_pcs=scfemb.shape[1]) # running bbknn 1.3.6 sc.tl.umap(scfadata) sc.pl.umap(scfadata,color=['cell_type_group','batch_id'],save='scfoundation',ncols=1)

print(scib.metrics.clisi_graph(merged,label_key='cell_typegroup',type='knn'))

print(scib.metrics.clisi_graph(scfadata,label_key='cell_typegroup',type='knn'))

print(scib.metrics.ilisi_graph(merged,batch_key='batchid',type='knn'))

print(scib.metrics.ilisi_graph(scfadata,batch_key='batchid',type='knn'))

sc.pl.umap(rawmerged,color=['cell_type_group','batch_id'],ncols=1,save='scvi')

Any ideas?

Thanks!

WhirlFirst commented 1 week ago

Hi,

We have provided all processed data in this link: https://figshare.com/articles/dataset/scFoundation_Large_Scale_Foundation_Model_on_Single-cell_Transcriptomics_-_processed_datasets/24049200?file=42171594 You can download them for checking.

As for your error, it seems that you didn't generate scFoundation embeddings for all cells.

ValueError: Value passed for key 'scf' is of incorrect shape. Values of obsm must match dimensions ('obs',) of parent. Value had shape (10,) while it should have had (13999,).

lamasJose commented 1 week ago

Thanks for the response.

My doubt now is, how this is an error of not the correct embeddings generated? I mean, what I did was taking your mapping ipynb file almost the same and running it with the data you provide. Did I not copy all the necessary lines or something?

WhirlFirst commented 1 week ago

Hi,

I suspect you load the wrong embedding file. The ./data/organoid_01B-resolution_singlecell_cell_embedding_t4.5_resolution.npy file needs to be generated on your own from the raw data. As depicted in the README, the ipynb file here is to generate the results stored in the Figshare. So if you want to execute the ipynb by yourself, you need to generate the embedding by using the script https://github.com/biomap-research/scFoundation/tree/main/model#2-inference we provided.

So I suggest you double-check the ./data/organoid_01B-resolution_singlecell_cell_embedding_t4.5_resolution.npy file you generated, or you can directly load the generated embedding from the scfadata.h5ad in the Figshare files.

lamasJose commented 1 week ago

Thank you! That worked