a-rios / longmbart

Apache License 2.0
4 stars 6 forks source link

Improve vocab filtering #3

Closed nicolasspring closed 3 years ago

nicolasspring commented 3 years ago

This PR adds a cli interface to the vocabulary filtering scripts and adds the possibility of creating raw (unfiltered) and unicode-filtered vocabs from a number of subword-segmented text files. Additionally, the two cli arguments --min_delta and --keep_special_tokens were added to the training and inference scripts, respectively.

Changes: