Tokenization scripts using `sentencepiece` in `scripts/train_tokenizer.py` and `scripts/tokenize_data.py`

LalitaDeelert / lalita-mt-zhth

Apache License 2.0

4 stars 0 forks source link

Closed cstorm125 closed 3 years ago

cstorm125 commented 3 years ago

scripts/train_tokenizer.py should train a sentencepiece tokenizer using both th and zh data to get tokenizer than can tokenize both languages.
scripts/tokenize_data.py should use the trained tokenizer to tokenize data that has been cleaned

cstorm125 commented 3 years ago

deprecated as implemented in scripts/train_shared_tokenizer.py