Open stoianmihail opened 4 years ago
Hello @stoianmihail , have a look into https://github.com/arahusky/diacritics_restoration/tree/master/data/create_corpus_scripts which contains README. This folder stores scripts that can automatically download clean monolingual data.
In case you already have monolingual data, simply run https://github.com/arahusky/diacritics_restoration/blob/master/data/diacritization_stripping.py to remove diacritics from it.
If I want to generate the data for the romanian language, how could I do that? Thanks a lot!