AMontgomerie / question_generator

An NLP system for generating reading comprehension questions
MIT License
273 stars 72 forks source link

Different answer and context tokens described in the documentation #17

Closed dzkb closed 2 years ago

dzkb commented 2 years ago

Hi! Thank you for the great work on this model and accompanying code!

I've noticed that the model's page on huggingface contains instructions on preparing the input text. The description indicates that two special tokens, answer_token and context_token, have to be used before the answer and the context respectively, but after browsing the code I've noticed that the question generation logic uses <answer> and <context> tokens. After initial tests it seems that in fact those <answer>/<context> tokens work correctly when used in generation.

What are the correct tokens for the pretrained model available on huggingface?

AMontgomerie commented 2 years ago

Hi, the values of the context token and answer token are <context> and <answer> respectively as shown in the source code. You can check the special token values with:

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("iarfmoose/t5-base-question-generator")
print(tokenizer.get_added_vocab())

which should print the tokens and their ids like this:

{'<answer>': 32100, '<context>': 32101}

I'll update the model card on the huggingface hub to make it more clear.

dzkb commented 2 years ago

Thank you!