alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
548 stars 19 forks source link

Continuous training: Deleted 0 of 0 tokens; Remaining 0 tokens; reachedMidway withinVocabX2 reachedVocab #7

Closed ianderrington closed 1 year ago

ianderrington commented 1 year ago

Splendid to see this algorithm and the name is stellar. I've been excited to test it out since it was shared!

I finally processed my files and got a vocab file. I executed the command on a very small text size for testing purposes:

./getalltokens -charset utf8 -chunk-size 10000 -dataset my_data.txt -min-occur-chunk 2 -output vocab.vc -workers 2 It gives seemingly no problem in output:

...
2023/06/08 09:55:34 Tokens before trimming: 350634
2023/06/08 09:55:34 Trimming final tokens for min 100
2023/06/08 09:55:34 Tokens after trimming: 14770
2023/06/08 09:55:34 Saving tokens list
2023/06/08 09:55:34 Done

I execute the next portion of the code:

./trainvocab -charset utf-8 -dataset my_data.txt -dir vocab_dir -dictionary vocab.vc -max-token-length 64 -vocab 1048575

It looks like I needed to set vocab to < total number of tokens after trimming. Perhaps that should be documented in the notes.

ianderrington commented 1 year ago

Oh wow, I can read. More steps to do...