arahusky / diacritics_restoration

Neural based model for automatic diacritics restoration.
24 stars 4 forks source link

Empty README in data #2

Open stoianmihail opened 4 years ago

stoianmihail commented 4 years ago

If I want to generate the data for the romanian language, how could I do that? Thanks a lot!

arahusky commented 4 years ago

Hello @stoianmihail , have a look into https://github.com/arahusky/diacritics_restoration/tree/master/data/create_corpus_scripts which contains README. This folder stores scripts that can automatically download clean monolingual data.

In case you already have monolingual data, simply run https://github.com/arahusky/diacritics_restoration/blob/master/data/diacritization_stripping.py to remove diacritics from it.