Splendid to see this algorithm and the name is stellar. I've been excited to test it out since it was shared!
I finally processed my files and got a vocab file. I executed the command on a very small text size for testing purposes:
./getalltokens -charset utf8 -chunk-size 10000 -dataset my_data.txt -min-occur-chunk 2 -output vocab.vc -workers 2
It gives seemingly no problem in output:
...
2023/06/08 09:55:34 Tokens before trimming: 350634
2023/06/08 09:55:34 Trimming final tokens for min 100
2023/06/08 09:55:34 Tokens after trimming: 14770
2023/06/08 09:55:34 Saving tokens list
2023/06/08 09:55:34 Done
Splendid to see this algorithm and the name is stellar. I've been excited to test it out since it was shared!
I finally processed my files and got a vocab file. I executed the command on a very small text size for testing purposes:
./getalltokens -charset utf8 -chunk-size 10000 -dataset my_data.txt -min-occur-chunk 2 -output vocab.vc -workers 2
It gives seemingly no problem in output:I execute the next portion of the code:
./trainvocab -charset utf-8 -dataset my_data.txt -dir vocab_dir -dictionary vocab.vc -max-token-length 64 -vocab 1048575
It looks like I needed to set vocab to < total number of tokens after trimming. Perhaps that should be documented in the notes.