Closed cstorm125 closed 3 years ago
scripts/train_tokenizer.py
sentencepiece
th
zh
scripts/tokenize_data.py
deprecated as implemented in scripts/train_shared_tokenizer.py
scripts/train_shared_tokenizer.py
scripts/train_tokenizer.py
should train asentencepiece
tokenizer using bothth
andzh
data to get tokenizer than can tokenize both languages.scripts/tokenize_data.py
should use the trained tokenizer to tokenize data that has been cleaned