Special tokens not showing up correctly when tokenized.

I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.

How do I make sure my special tokens always map to the same id? code to reproduce what I am seeing:

vocab = tokenmonster.load("englishcode-32000-consistent-v1")

vocab.modify(["<|im_start|>", "<|im_end|>", "<s>"], None, None, 0)

vocab.resize(32000, reset_token_ids=False)

# Tokenize some text
text = [
    "<s>Some text to turn into token IDs. Why is this happening?<|im_end|>",
    "<s>Some text to turn into token IDs. <|im_end|>",
    "<s>Some text to turn into token IDs....<|im_end|>",
]

alasdairforsythe / tokenmonster

Special tokens not showing up correctly when tokenized. #29