alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

code-65536 models cannot decode #23

Closed gautierdag closed 10 months ago

gautierdag commented 10 months ago

Hi, I was just trying out the code tokenizers, seems like all the code-65636-* models are all unable to decode:

import tokenmonster

tokenizer = tokenmonster.load("code-65536-balanced-nocapcode-v1")
tokens = tokenizer.tokenize("hello world") # [  127 51042]
decoded_string = tokenizer.decode(tokens)
print(decoded_string)
> ''

The 100k and 32k models work.

alasdairforsythe commented 10 months ago

Thanks for the bug report.

It was caused by the maximum token ID overflowing from uint16 65536 to 0 when checking for valid token IDs, thereby rendering all token IDs invalid.

I've fixed this and pushed the change. It requires updating to the latest version (1.1.11) pip install --upgrade --no-cache-dir tokenmonster. Please let me know if you encounter any issues.