bowang-lab / scGPT

https://scgpt.readthedocs.io/en/latest/
MIT License
1.05k stars 205 forks source link

74% of the tokens in adata.var[feature_name] are not in vocab. Please check if using the correct vocab and token_col. #225

Open Jasmine710-lab opened 4 months ago

Jasmine710-lab commented 4 months ago

Hello!

First, I would like to express my appreciation for your impressive work on the project. I've been working with the pretraining data and unfortunately, I've run into a similar issue as previously reported in issue #139. Unfortunately, it appears that there hasn't been a response to that issue yet.

I am encountering a ValueError indicating that a significant number of tokens in adata.var[feature_name] are not present in the vocabulary. This seems to be a common issue since, upon reviewing the scg.scbank.databank code, I noticed that there's a validation step where tokens are checked against the vocabulary:

# validate matching between tokens and vocab
tokens = adata.var[token_col].tolist()
match_ratio = sum([1 for t in tokens if t in self.gene_vocab]) / len(tokens)
if match_ratio < 0.9:
    raise ValueError(
        f"{match_ratio*100:.0f}% of the tokens in adata.var[{token_col}] are not in vocab. Please check if using the correct vocab and token_col."
    )

According to this, if the match_ratio is less than 0.9, the process raises an error: {match_ratio*100:.0f}% of the tokens in adata.var[{token_col}] are not in vocaband seems to skip processing those files.

Could you please advise on how to resolve this issue? Is there an updated vocabulary that I should be using, or perhaps a different token_col setting that aligns better with the available data?

Thank you very much for your time and assistance. I look forward to your guidance on resolving this challenge.

q225yang commented 3 months ago

I am not the author, but I encountered a similar issue and resolved it by updating the vocabulary. I think you can use expand_gene_list.py to update the vocabulary to your version (the same as specified in data_config.py). After the update, there shouldn't be many genes missing from the vocabulary when you convert .h5ad to .scb. Although, there seems to be a typo in the error message( "{(1-match_ratio)*100:.0f}%" of the tokens ...).