This PR adds a cli interface to the vocabulary filtering scripts and adds the possibility of creating raw (unfiltered) and unicode-filtered vocabs from a number of subword-segmented text files.
Additionally, the two cli arguments --min_delta and --keep_special_tokens were added to the training and inference scripts, respectively.
Changes:
Add --min_delta cli arg to the training script.
Add --keep_special_tokens to the inference script.
Fix an import and speed up membership test for vocabulary creation.
Add a cli interface for creating vocabularies of all kinds (raw, unicode-filtered and frequency-filtered).
This PR adds a cli interface to the vocabulary filtering scripts and adds the possibility of creating raw (unfiltered) and unicode-filtered vocabs from a number of subword-segmented text files. Additionally, the two cli arguments
--min_delta
and--keep_special_tokens
were added to the training and inference scripts, respectively.Changes:
--min_delta
cli arg to the training script.--keep_special_tokens
to the inference script.