alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

What is the difference between `50256-consistent-oneword` and `50256-consistent`? #24

Open Calvinnncy97 opened 10 months ago

Calvinnncy97 commented 10 months ago

Discussed in https://github.com/alasdairforsythe/tokenmonster/discussions/22

Originally posted by **Calvinnncy97** September 4, 2023 Hey guys, Specifically, I would like to ask which flags are set to train these 2 tokenizers? I can't find any flags that force tokenizer to have only 1 word tokens. Thank you.
alasdairforsythe commented 10 months ago

The oneword vocabularies (none of which are released) have -words-per-token 1 parameter set in getalltokens. Currently the words-per-token parameter is only implemented for strict and consistent modes.

I'd also recommend flags -mode consistent -charset UTF8 -only-latin -only-valid during getalltokens