doozan / spanish_data

Spanish to English dictionary, frequency list, and lemma data
Creative Commons Attribution 4.0 International
22 stars 4 forks source link

What are the needed procedures to produce "es_allforms.csv"? #5

Open jarork opened 12 hours ago

jarork commented 12 hours ago

Hi doozan, It is very amazing to see all inflections have been removed in "es_allforms.csv" file, (hablaríamos,v,hablar) I'm at the beginning stage of designing an AI language learning APP with the vocab teaching module. Recently, I'm looking for the word lists of different languages.

What if I want to build a vocab list in German, would you like to suggest me a good method of removing the inflections(lemmatization) from the word freq list? Many thanks, Jake.

jarork commented 10 hours ago

I have tried SpaCy with python, it can't lemmatize well. For example, when the input is "hablas", it outputs "habla" instead of "hablar"