LalitaDeelert / lalita-mt-zhth

Apache License 2.0
4 stars 0 forks source link

Tokenization scripts using `sentencepiece` in `scripts/train_tokenizer.py` and `scripts/tokenize_data.py` #4

Closed cstorm125 closed 3 years ago

cstorm125 commented 3 years ago
  1. scripts/train_tokenizer.py should train a sentencepiece tokenizer using both th and zh data to get tokenizer than can tokenize both languages.
  2. scripts/tokenize_data.py should use the trained tokenizer to tokenize data that has been cleaned
cstorm125 commented 3 years ago

deprecated as implemented in scripts/train_shared_tokenizer.py