Closed LiuJinzhe-Keepgoing closed 8 months ago
Thanks for your interest in our work and bringing this issue to my attention.
The error you encountered with the Hugging Face's biot5-base-text2mol
tokenizer is indeed a consequence of a discrepancy between the transformers
library versions. When I initially uploaded the model and tokenizer to the Hugging Face repository, I used a higher version of the transformers library (4.34.x
) than what is compatible (<=4.33.3
), as specified in the requirements.txt
of the GitHub repository. This incompatibility arose due to changes made in the token addition logic in the transformers 4.34.0
update, leading to inconsistencies in the dictionary when new tokens are added.
To address this, I have updated the Hugging Face repository with the transformers
version 4.33.3
, which should resolve the issue. Please try loading the tokenizer again with this updated version.
Regarding your second question about the biot5-base
model and its application in text2mol: It is important to note that the biot5-base
model, as it currently stands, has not undergone task-specific instruction tuning. Moreover, it is influenced by other pre-training tasks. Consequently, using the text2mol template with simple text input may not directly yield the corresponding molecule outputs. This limitation is inherent to the model's current training and tuning status.
Once again, thank you for pointing out this issue.
Very exciting work, but I reported an error when loading Hugging Face's biot5-base-text2mol tokenizer. I tried transformers==4.28.1 and 4.33.3 versions.
code:
tokenizer = T5Tokenizer.from_pretrained("/workspace/LLM_ckpt/QizhiPei/biot5-base-text2mol", model_max_length=512)
ValueError: Non-consecutive added token '' found. Should have index 32100 but has index 0 in saved vocabulary.
I have another question. When I load the biot5-base model for text2mol, I follow the text2mol code. The results are as follows:
output_selfies:
output_smiles:
My code:
Thank you for your help!!!