Open Calvinnncy97 opened 1 year ago
The oneword
vocabularies (none of which are released) have -words-per-token 1
parameter set in getalltokens
. Currently the words-per-token
parameter is only implemented for strict
and consistent
modes.
I'd also recommend flags -mode consistent -charset UTF8 -only-latin -only-valid
during getalltokens
Discussed in https://github.com/alasdairforsythe/tokenmonster/discussions/22