Open kabachuha opened 3 months ago
Nice! It is odd that it differs - How do you instantiate the tokenizer
? Maybe there is a special token that's missing or something similar
from gritlm import GritLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("GritLM/GritLM-7B")
tokenizer_max_length = 300
# the part with docs
...
tokenizer_output_x = tokenizer(
documents,
padding='max_length',
truncation=True,
max_length=tokenizer_max_length,
return_tensors="pt",
)
Nothing unusual, but I do set the max length to enable batch encode
Can you try without the max length and see if you get the same results? I think the results should be exactly the same.
Alright, thank you for noticing! I've found the problem:
I did a generation-only test earlier in the notebook, and it did
Setting
pad_token_id
toeos_token_id
:2 for open-end generation.
Now without launching a generation cell first, the results with dictionary and the tokenizer output class are exactly the same
Useful for batch processing and making embeddings cache of numerous documents with dataloaders.
The results for dict and the vanilla strings list are identical, although for the raw tokenized 'transformers' encoding it differs a bit, but I think it's just the behavior of that library.