alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
551 stars 21 forks source link

"data is required error" #5

Closed enpassanty closed 1 year ago

enpassanty commented 1 year ago

I'm getting a "dataset is required" error with this command:

./getalltokens -capcode true -charset UTF-8 -chunk-size 100000 -dataset /Users/me/wikitext-103-raw/wikitest.raw -max-token-length 1000 -micro-chunks 10 -min-occur 30 -min-occur-chunk 3 -output string enwiki_dict.txt -workers 1

alasdairforsythe commented 1 year ago

The command line flags are parsed using Go standard library, so it should work. Try putting the filepath in "quotes"