Liebeck / IWNLP.Lemmatizer

IWNLP.Lemmatizer is a dictionary-based lemmatizer for the German language
http://www.iwnlp.com/
MIT License
3 stars 0 forks source link

Incomplete project #1

Open strohhut opened 3 years ago

strohhut commented 3 years ago

I see that Lemmatizer.cs has the stament using IWNLP.Models; but that namespace is defined in a different solution: IWNLP from this repo. I t hink it would make sense to have a complete, buildable solution here since the docs say "Clone IWNLP.Lemmatizer and build it".

Liebeck commented 3 years ago

@strohhut Yes, IWNLP.Lemmatizer references IWNLP but as direct reference to the built DLL rather than the project itself. I agree, it would also make sense to only use one solution and one Git repo.

To be honest, I've rarely used the C# implementation (only for evaluation and export) in the past. For NLP, I've been using the Python port https://github.com/Liebeck/IWNLP-py and my integration into spaCy https://github.com/Liebeck/spacy-iwnlp

Have you tried out IWNLP with the latest dump? Does it work or does it skip a lot of words? It's been some time for me outside of academia.

strohhut commented 3 years ago

I'm not sure how many words are skipped. During parsing there is only a very small number of errors and exceptions that are being logged. I get the following amount of words with a current dump

AdjectivalDeclensionDeutschAdjektivischUebersichtTotal: 1664
AdjectivesTotal: 12142
Nouns: 80957
NounsDeutschNameUebersichtTotal: 313
NounsDeutschSubstantivUebersichtSchTotal: 456
VerbsConjugationIrregular: 2483
VerbsConjugationRegular: 9747
VerbsConjugationTotal: 12663
VerbsConjugationWeakInseparable: 0
VerbsTotal: 12011

but I'm not sure about the total number of words in wiktionary. This page says there are 90170 pages for nouns and 11716 pages for verbs for instance but I don't know how accurate this is since it mentions fewer verbs than IWNLP.Lemmatizer contains as a result.

How is the python different from the .NET version? Is it the recommended over .NET even for standalone use (e.g. without spacy)?