JackieHanLab / TOSICA

Transformer for One-Stop Interpretable Cell-type Annotation
MIT License
121 stars 23 forks source link

could you share the other 5 datasets detailed preprocessing codes #9

Closed rushrush2022 closed 1 year ago

rushrush2022 commented 1 year ago

Dear JackieHanLab,

Could you share the other 5 dataset pre-process full codes, or detailed instructions on preprocess on each dataset?

I hope to reproduce your paper on all the 6 datasets, now, there is only the well pre-processed data "hPancreas" in https://figshare.com/projects/TOSICA_demo/158489

Now, I had already downloaded GSE152805_RAW.tar for hBone and GSE132042 for mAtlas, but indeed, I tried and failed to preprocess these data. thanks very much in advance!!

JackieHanLab commented 1 year ago

You can download the cell type annotation or reproduce the labels as follows:

For hArtery and hBone, you can reproduce the labels following the methods and cell type markers described in Liang, et.al, 2021 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8685327/ , Fig. 2B and Table S1) and Chou, et.al, 2020 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7331607/ , Table 1), respectively.

For mBrain, you can download from: https://drive.google.com/drive/folders/1QQXDuUjKG8CTnwWW_u83MDtdrBXr8Kpq We use the ‘cell_type’ as label.

For mAtlas, you can download from: https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102/2 We use the ‘cell_ontology_class’ as label.

For mPancreas: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132188/suppl/GSE132188_adata.h5ad.h5 We use the sub-type from 'clusters_fig6_broad_final' and 'clusters_fig6_fine_final' as label.

And we use the Scanpy to normalize the counts data. The coda is shown as blow: import scanpy as sc sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata) adata = adata[:, adata.var.highly_variable]