Closed knowledge-llz closed 2 months ago
The tokenizer.tiktoken
used by LongCite-8B is actually tokenizer.model
from the original Llama3.1 provided by Meta (not huggingface version). We add some special tokens such as "<|user|>", "<|assistant|>" into the tokenizer (as shown in tiktoken_tokenizer.py
) so that it uses the same chat format as LongCite-9B, which is convenient for experiments. You can also add these tokens by modifying the huggingface llama3 tokenizer.
I noticed that Longcite-8B uses TikTokenizer instead of Llama 3's tokenizer. Could you explain what modifications Longcite-8B made to its vocabulary? Why were these changes made?