Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
254 stars 40 forks source link

Preparing custom reference files #85

Closed Tripfantasy closed 9 months ago

Tripfantasy commented 9 months ago

Hello! Thank you for the work, I appreciate the inclusion of olfactory models. Not necessarily an issue with the tool itself, but I am wanting to annotate with a custom dataset as reference, and was wondering how to go about formatting files to best be used as training data. I am working with this dataset:

https://assets.nemoarchive.org/dat-jb2f34y

Which contains a csv of metadata/label information, and an h5 file with count matrix/barcodes/genes etc. is there any standard way to consolidate the metadata and h5 file to use as reference. (Especially considering the size of the file)

Thanks!

ChuanXu1 commented 9 months ago

@Tripfantasy, this model is available in CellTypist as "Mouse_Isocortex_Hippocampus.pkl".

For processing these files into a h5ad object, I am not sure of a standard way, but below is the code I usually used to do this:

import pandas as pd
import h5py
import scanpy as sc
from scipy.sparse import csr_matrix

f = h5py.File('expression_matrix.hdf5', 'r')

adata = sc.AnnData(csr_matrix(f['data']['counts'][()])).T
adata.var_names = [g.decode('utf-8') for g in f['data']['gene'][()]]
adata.obs_names = [c.decode('utf-8') for c in f['data']['samples'][()]]

coor = pd.read_csv('tsne.csv', index_col = 0)
assert coor.shape[0] == coor.index.intersection(adata.obs_names).size
assert coor.shape[0] == len(adata.obs_names)
adata.obsm['X_umap'] = coor.loc[adata.obs_names].values

meta = pd.read_csv('metadata.csv', index_col = 0)
assert meta.shape[0] == meta.index.intersection(adata.obs_names).size
assert meta.shape[0] < len(adata.obs_names)
adata = adata[meta.index].copy()
adata.obs = meta

adata.write('Yao_2021.h5ad')
Tripfantasy commented 9 months ago

Oh neat! Thank you for the clarification, this helps a lot.