bst-mug / acres

Acronym expansion module based on word embeddings and filtering rules
Apache License 2.0
1 stars 2 forks source link

German Stemmer #123

Open michelole opened 4 years ago

michelole commented 4 years ago

Since plural and grammatical case are all considered perfect matches in our annotation guidelines, we could apply a stemmer to the data to make our models denser.

However, we might need to annotate the new expansions because some pairs might decrease ranking during stemming due to it being considered an abbreviation (e.g. "Vorbefund" -> "Vorbefu", "Vesikuläratmen" -> "Vesikuläratm", "Operation" -> "Operatio").

The CISTEM stemmer seems to improve results over Porter stemmer and has a Python NLTK implementation.

Relates to #87.