74% of the tokens in adata.var[feature_name] are not in vocab. Please check if using the correct vocab and token_col.

Hello!

First, I would like to express my appreciation for your impressive work on the project. I've been working with the pretraining data and unfortunately, I've run into a similar issue as previously reported in issue #139. Unfortunately, it appears that there hasn't been a response to that issue yet.

I am encountering a ValueError indicating that a significant number of tokens in adata.var[feature_name] are not present in the vocabulary. This seems to be a common issue since, upon reviewing the scg.scbank.databank code, I noticed that there's a validation step where tokens are checked against the vocabulary:

# validate matching between tokens and vocab
tokens = adata.var[token_col].tolist()
match_ratio = sum([1 for t in tokens if t in self.gene_vocab]) / len(tokens)
if match_ratio < 0.9:
    raise ValueError(
        f"{match_ratio*100:.0f}% of the tokens in adata.var[{token_col}] are not in vocab. Please check if using the correct vocab and token_col."
    )

According to this, if the match_ratio is less than 0.9, the process raises an error: {match_ratio*100:.0f}% of the tokens in adata.var[{token_col}] are not in vocaband seems to skip processing those files.

Could you please advise on how to resolve this issue? Is there an updated vocabulary that I should be using, or perhaps a different token_col setting that aligns better with the available data?

Thank you very much for your time and assistance. I look forward to your guidance on resolving this challenge.

bowang-lab / scGPT

74% of the tokens in adata.var[feature_name] are not in vocab. Please check if using the correct vocab and token_col. #225