Open Jasmine710-lab opened 4 months ago
I am not the author, but I encountered a similar issue and resolved it by updating the vocabulary. I think you can use expand_gene_list.py
to update the vocabulary to your version (the same as specified in data_config.py
). After the update, there shouldn't be many genes missing from the vocabulary when you convert .h5ad
to .scb
. Although, there seems to be a typo in the error message( "{(1-match_ratio)*100:.0f}%" of the tokens ...
).
Hello!
First, I would like to express my appreciation for your impressive work on the project. I've been working with the pretraining data and unfortunately, I've run into a similar issue as previously reported in issue #139. Unfortunately, it appears that there hasn't been a response to that issue yet.
I am encountering a ValueError indicating that a significant number of tokens in adata.var[feature_name] are not present in the vocabulary. This seems to be a common issue since, upon reviewing the scg.scbank.databank code, I noticed that there's a validation step where tokens are checked against the vocabulary:
According to this, if the match_ratio is less than 0.9, the process raises an error: {match_ratio*100:.0f}% of the tokens in adata.var[{token_col}] are not in vocaband seems to skip processing those files.
Could you please advise on how to resolve this issue? Is there an updated vocabulary that I should be using, or perhaps a different token_col setting that aligns better with the available data?
Thank you very much for your time and assistance. I look forward to your guidance on resolving this challenge.