bigscience-workshop / catalogue_data

Scripts to prepare catalogue data
Apache License 2.0
8 stars 1 forks source link

Fix vi sent tokenizer #54

Closed lvwerra closed 2 years ago

lvwerra commented 2 years ago

Adds a dedicated sentence tokenizer for vietnamese using underthesea.

lvwerra commented 2 years ago

These are the affected files, all the other vi should not apply sentence splitting:

lm_vi_wiktionary_filtered
lm_vi_wikibooks_filtered
lm_vi_wikiquote_filtered
lm_vi_wikivoyage_filtered