Open strohhut opened 3 years ago
@strohhut Yes, IWNLP.Lemmatizer references IWNLP but as direct reference to the built DLL rather than the project itself. I agree, it would also make sense to only use one solution and one Git repo.
To be honest, I've rarely used the C# implementation (only for evaluation and export) in the past. For NLP, I've been using the Python port https://github.com/Liebeck/IWNLP-py and my integration into spaCy https://github.com/Liebeck/spacy-iwnlp
Have you tried out IWNLP with the latest dump? Does it work or does it skip a lot of words? It's been some time for me outside of academia.
I'm not sure how many words are skipped. During parsing there is only a very small number of errors and exceptions that are being logged. I get the following amount of words with a current dump
AdjectivalDeclensionDeutschAdjektivischUebersichtTotal: 1664
AdjectivesTotal: 12142
Nouns: 80957
NounsDeutschNameUebersichtTotal: 313
NounsDeutschSubstantivUebersichtSchTotal: 456
VerbsConjugationIrregular: 2483
VerbsConjugationRegular: 9747
VerbsConjugationTotal: 12663
VerbsConjugationWeakInseparable: 0
VerbsTotal: 12011
but I'm not sure about the total number of words in wiktionary. This page says there are 90170 pages for nouns and 11716 pages for verbs for instance but I don't know how accurate this is since it mentions fewer verbs than IWNLP.Lemmatizer contains as a result.
How is the python different from the .NET version? Is it the recommended over .NET even for standalone use (e.g. without spacy)?
I see that
Lemmatizer.cs
has the stamentusing IWNLP.Models;
but that namespace is defined in a different solution:IWNLP
from this repo. I t hink it would make sense to have a complete, buildable solution here since the docs say "Clone IWNLP.Lemmatizer and build it".