QizhiPei / BioT5

BioT5 (EMNLP 2023) and BioT5+ (ACL 2024 Findings)
https://arxiv.org/abs/2310.07276
MIT License
91 stars 5 forks source link

ValueError: Non-consecutive added token '<pad>' found. Should have index 32100 but has index 0 in saved vocabulary. #5

Closed LiuJinzhe-Keepgoing closed 8 months ago

LiuJinzhe-Keepgoing commented 8 months ago

Very exciting work, but I reported an error when loading Hugging Face's biot5-base-text2mol tokenizer. I tried transformers==4.28.1 and 4.33.3 versions.

code: tokenizer = T5Tokenizer.from_pretrained("/workspace/LLM_ckpt/QizhiPei/biot5-base-text2mol", model_max_length=512)

ValueError: Non-consecutive added token '' found. Should have index 32100 but has index 0 in saved vocabulary.

I have another question. When I load the biot5-base model for text2mol, I follow the text2mol code. The results are as follows:

output_selfies:

<p>M<p>T<p>T<p>P<p>T<p>P<p>S<p>P<p>S<p>P<p>A<p>P<p>S<p>P<p>A<p>P<p>S<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P

output_smiles:

My code:

from transformers import AutoTokenizer, T5ForConditionalGeneration,T5Tokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("/workspace/LLM_ckpt/QizhiPei/biot5-base")
model = T5ForConditionalGeneration.from_pretrained('/workspace/LLM_ckpt/QizhiPei/biot5-base')

# tokenizer = T5Tokenizer.from_pretrained("QizhiPei/biot5-base-text2mol", model_max_length=512)
# model = T5ForConditionalGeneration.from_pretrained('QizhiPei/biot5-base-text2mol')

task_definition = 'Definition: You are given a molecule description in English. Your job is to generate the molecule SELFIES that fits the description.\n\n'
text_input = 'The molecule is a monocarboxylic acid anion obtained by deprotonation of the carboxy and sulfino groups of 3-sulfinopropionic acid. Major microspecies at pH 7.3 It is an organosulfinate oxoanion and a monocarboxylic acid anion. It is a conjugate base of a 3-sulfinopropionic acid.'
task_input = f'Now complete the following example -\nInput: {text_input}\nOutput: '

model_input = task_definition + task_input
input_ids = tokenizer(model_input, return_tensors="pt").input_ids

generation_config = model.generation_config
generation_config.max_length = 512
generation_config.num_beams = 1

outputs = model.generate(input_ids, generation_config=generation_config)
output_selfies = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(' ', '')
print("output_selfies:",output_selfies)

import selfies as sf
output_smiles = sf.decoder(output_selfies)
print("output_smiles:",output_smiles)

Thank you for your help!!!

QizhiPei commented 8 months ago

Thanks for your interest in our work and bringing this issue to my attention.

The error you encountered with the Hugging Face's biot5-base-text2mol tokenizer is indeed a consequence of a discrepancy between the transformers library versions. When I initially uploaded the model and tokenizer to the Hugging Face repository, I used a higher version of the transformers library (4.34.x) than what is compatible (<=4.33.3), as specified in the requirements.txt of the GitHub repository. This incompatibility arose due to changes made in the token addition logic in the transformers 4.34.0 update, leading to inconsistencies in the dictionary when new tokens are added.

To address this, I have updated the Hugging Face repository with the transformers version 4.33.3, which should resolve the issue. Please try loading the tokenizer again with this updated version.

Regarding your second question about the biot5-base model and its application in text2mol: It is important to note that the biot5-base model, as it currently stands, has not undergone task-specific instruction tuning. Moreover, it is influenced by other pre-training tasks. Consequently, using the text2mol template with simple text input may not directly yield the corresponding molecule outputs. This limitation is inherent to the model's current training and tuning status.

Once again, thank you for pointing out this issue.