JackieHanLab / TOSICA

Transformer for One-Stop Interpretable Cell-type Annotation
MIT License
121 stars 23 forks source link

use TOSICA for cell classification #13

Open suhuanhou opened 1 year ago

suhuanhou commented 1 year ago

I would like to use TOSICA for cell classification. Can you provide specific examples, especially how to construct a training set.

JiaweiChenGo commented 10 months ago

Thanks for you interest. Running demo is here. For training set construction, you can choose a well annotated dataset according to your research needs, then preprocess it by sc.pp.normalize_total, sc.pp.log1p and sc.pp.highly_variable_genes and save it as an AnnData object.

zclecle2 commented 7 months ago

Thanks for developing the tool for automatic cell type annotation!

I also want to ask about how to prepare the training set. Are the following codes enough for preparation, supposing that train_adata originally contains 35699 cells with 18010 genes: sc.pp.normalize_total(train_adata, target_sum=1e4) sc.pp.log1p(train_adata) sc.pp.highly_variable_genes(train_adata).

Or do I need to filter the train_adata to contain only highly_variable_genes? And is that ok if my train_adata are already normalized data such as one export from the data layer of Seurat object and I still let it go through the above 3 lines of code? And do you have any suggestions on how to choose epochs and gmt_path to get better training and prediction results? What value should I pay attention to if I want to assess whether the training is good or not if I don't know the truth cell type for query data? Should I stop increasing epoch number if I see the accu value nearly flattens? When I tried to train my own reference dataset, I found that the initial accu value is quite low (shown in the following image), is this normal? (train_adata originally contains 35699 cells with 13295 genes, with running the above 3 lines) image Appreciated your reply!

SteGruener commented 4 months ago

I would also be interested in answers to questions raised above.

JiaweiChenGo commented 4 months ago
  1. I will use HVGs to train the model. If all genes were used, there will be more parameters need to train and the training process will be longer.
  2. The 3 lines is used to normalize the data. Normalized data can be used as input for the model.
  3. Yes, you can stop increasing epoch number when the accu value nearly flattens.
  4. As we described in the paper Supplementary Figure 8, you can choose any knowledge mask depending on biological context or your research interests.
  5. For the unknown query data, you can use UMAP to visualize the query and reference data in the TOSICA attention latent space to see if it is reasonable. And you can get the marker genes in each predicted cell type group to check the annotation.
  6. When you use all genes, the model will be much larger and accu value will slowly increase.