I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.
How do I make sure my special tokens always map to the same id?
code to reproduce what I am seeing:
vocab = tokenmonster.load("englishcode-32000-consistent-v1")
vocab.modify(["<|im_start|>", "<|im_end|>", "<s>"], None, None, 0)
vocab.resize(32000, reset_token_ids=False)
# Tokenize some text
text = [
"<s>Some text to turn into token IDs. Why is this happening?<|im_end|>",
"<s>Some text to turn into token IDs. <|im_end|>",
"<s>Some text to turn into token IDs....<|im_end|>",
]
It's unclear what you're trying to do, what you expect to happen, and what is happening. Please provide the results of what you get, and a description of what you expected to get.
I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.
How do I make sure my special tokens always map to the same id? code to reproduce what I am seeing: