Normalizing texts in other languages with lucem_illud_2020

Computational-Content-Analysis-2020 / frequently-asked-questions

Repo to ask questions and see answers

2 stars 0 forks source link

Normalizing texts in other languages with lucem_illud_2020 #29

Closed luisesanmartin closed 4 years ago

luisesanmartin commented 4 years ago

Is it possible to use lucem_illud_2020.normalizeTokens() for normalizing texts in Spanish or other languages?

bhargavvader commented 4 years ago

It is indeed. You would need to download a spaCy language model for that language (https://spacy.io/models/es). I'm going to add an option for users to manually pass their own language model in the next lucem_illud_update. If you need it right away: check the code for normalising tokens in HW4, and the language model nlp = spacy.load('en') replace with your spanish model.

luisesanmartin commented 4 years ago

Awesome. Thanks!