Add support for encoding pretokenized sequences

ContextualAI / gritlm

Generative Representational Instruction Tuning

https://arxiv.org/abs/2402.09906

MIT License

538 stars 39 forks source link

Add support for encoding pretokenized sequences #42

Open kabachuha opened 3 months ago

kabachuha commented 3 months ago

Useful for batch processing and making embeddings cache of numerous documents with dataloaders.

The results for dict and the vanilla strings list are identical, although for the raw tokenized 'transformers' encoding it differs a bit, but I think it's just the behavior of that library.

Снимок экрана 2024-06-16 124103 Снимок экрана 2024-06-16 124134 Снимок экрана 2024-06-16 124039

Muennighoff commented 3 months ago

Nice! It is odd that it differs - How do you instantiate the tokenizer? Maybe there is a special token that's missing or something similar

kabachuha commented 3 months ago

from gritlm import GritLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("GritLM/GritLM-7B")

tokenizer_max_length = 300

# the part with docs
...

tokenizer_output_x = tokenizer(
    documents,
    padding='max_length',
    truncation=True,
    max_length=tokenizer_max_length,
    return_tensors="pt",
)

Nothing unusual, but I do set the max length to enable batch encode

Muennighoff commented 3 months ago

Can you try without the max length and see if you get the same results? I think the results should be exactly the same.

kabachuha commented 3 months ago

Alright, thank you for noticing! I've found the problem:

I did a generation-only test earlier in the notebook, and it did

Setting pad_token_id to eos_token_id:2 for open-end generation.

Now without launching a generation cell first, the results with dictionary and the tokenizer output class are exactly the same