Closed xqiu625 closed 1 week ago
Hi, thank you for the question! We recommend using our customed GeneVocab
class, it is here https://github.com/bowang-lab/scGPT/blob/7301b51a72f5db321fccebb51bc4dd1380d99023/scgpt/tokenizer/gene_tokenizer.py#L20
One usecase can be found her in the cell_emb.py
, the vocab is also loaded in similar fashions in the tutorial notebooks.
I'm trying to generate embeddings from scGPT for my single-cell data but encountering tokenization issues. Here's my scenario and the errors I'm facing:
Initial Setup:
I've tried different approaches to handle the vocabulary:
Approach 1: Using raw JSON dictionary
Approach 2: Using custom vocabulary class
The error occurs during tokenization:
at this line in gene_tokenizer.py:
I'm using this embedding generation function based on the example of Tutorial_Reference_Mapping_dataset.ipynb:
Base on those here are my questions:
Here is the environ I am using:
Let me know if you'd like me to provide any additional information or test any specific solutions.