Longcite-8b's Tokenizer

THUDM / LongCite

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

Apache License 2.0

406 stars 29 forks source link

Longcite-8b's Tokenizer #2

Closed knowledge-llz closed 2 months ago

knowledge-llz commented 2 months ago

I noticed that Longcite-8B uses TikTokenizer instead of Llama 3's tokenizer. Could you explain what modifications Longcite-8B made to its vocabulary? Why were these changes made?

Neo-Zhangjiajie commented 2 months ago

The tokenizer.tiktoken used by LongCite-8B is actually tokenizer.model from the original Llama3.1 provided by Meta (not huggingface version). We add some special tokens such as "<|user|>", "<|assistant|>" into the tokenizer (as shown in tiktoken_tokenizer.py) so that it uses the same chat format as LongCite-9B, which is convenient for experiments. You can also add these tokens by modifying the huggingface llama3 tokenizer.