alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

Special tokens not showing up correctly when tokenized. #29

Open amazingvince opened 8 months ago

amazingvince commented 8 months ago

I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.

How do I make sure my special tokens always map to the same id? code to reproduce what I am seeing:

vocab = tokenmonster.load("englishcode-32000-consistent-v1")

vocab.modify(["<|im_start|>", "<|im_end|>", "<s>"], None, None, 0)

vocab.resize(32000, reset_token_ids=False)

# Tokenize some text
text = [
    "<s>Some text to turn into token IDs. Why is this happening?<|im_end|>",
    "<s>Some text to turn into token IDs. <|im_end|>",
    "<s>Some text to turn into token IDs....<|im_end|>",
]
alasdairforsythe commented 8 months ago

It's unclear what you're trying to do, what you expect to happen, and what is happening. Please provide the results of what you get, and a description of what you expected to get.