Liebeck / IWNLP.Lemmatizer

IWNLP.Lemmatizer is a dictionary-based lemmatizer for the German language
MIT License
3 stars 0 forks source link

Incomplete project #1

Open strohhut opened 3 years ago

strohhut commented 3 years ago

I see that Lemmatizer.cs has the stament using IWNLP.Models; but that namespace is defined in a different solution: IWNLP from this repo. I t hink it would make sense to have a complete, buildable solution here since the docs say "Clone IWNLP.Lemmatizer and build it".

Liebeck commented 3 years ago

@strohhut Yes, IWNLP.Lemmatizer references IWNLP but as direct reference to the built DLL rather than the project itself. I agree, it would also make sense to only use one solution and one Git repo.

To be honest, I've rarely used the C# implementation (only for evaluation and export) in the past. For NLP, I've been using the Python port and my integration into spaCy

Have you tried out IWNLP with the latest dump? Does it work or does it skip a lot of words? It's been some time for me outside of academia.

strohhut commented 3 years ago

I'm not sure how many words are skipped. During parsing there is only a very small number of errors and exceptions that are being logged. I get the following amount of words with a current dump

AdjectivalDeclensionDeutschAdjektivischUebersichtTotal: 1664
AdjectivesTotal: 12142
Nouns: 80957
NounsDeutschNameUebersichtTotal: 313
NounsDeutschSubstantivUebersichtSchTotal: 456
VerbsConjugationIrregular: 2483
VerbsConjugationRegular: 9747
VerbsConjugationTotal: 12663
VerbsConjugationWeakInseparable: 0
VerbsTotal: 12011

but I'm not sure about the total number of words in wiktionary. This page says there are 90170 pages for nouns and 11716 pages for verbs for instance but I don't know how accurate this is since it mentions fewer verbs than IWNLP.Lemmatizer contains as a result.

How is the python different from the .NET version? Is it the recommended over .NET even for standalone use (e.g. without spacy)?