danielnaber / jwordsplitter

small Java library for splitting German compound words
Other
62 stars 11 forks source link

Lexicon expansion #15

Open GiPfi opened 6 years ago

GiPfi commented 6 years ago

After testing jwordsplitter on a dataset of German technical vocabulary, a number of words have been extracted which so far had been missing in the languagetool_dict.txt and germanPrefixes.txt lists. These words have been included and the tests have been adjusted accordingly. Further testing may result in more suggestions for words to be added.

danielnaber commented 6 years ago

Thanks. Have you checked the comment at languagetool-dict.txt - it's an export from LanguageTool, any changes would be lost with the next update. Or have you run a new export?

GiPfi commented 6 years ago

Thanks for the hint! No, I haven't run a new export 😕

If I may ask... since unlike for the other languages, there's no german.dict in org/languagetool/resource/de/, are you using the de_DE.dict in org/languagetool/resource/de/hunspell as the German dictionary? I'd like to check the whole LanguageTool-lexicon for German to not add any duplicates to added.txt

danielnaber commented 6 years ago

No, we've used german.dict and added.txt, which are not used for spelling but contain part-of-speech information.

GiPfi commented 6 years ago

Ok great, thanks for the quick response! 😊

GiPfi commented 6 years ago

Do the tags and variants for German words which are unknown to the tagger - as e.g. "Alufelge" - have to be added manually? I assume that words without tags shouldn't be added to the added.txt but the format should be token - lemma - PoS-tag 🤔

danielnaber commented 6 years ago

added.txt is part of LanguageTool, so I guess you're talking about that. As a compound, it's decompounded by jwordpsplitter. If that doesn't work, adding alu and/or felge to additions.txt in jwordsplitter should help. Or you can add the compound to added.txt in LanguageTool using the format you mentioned.

GiPfi commented 6 years ago

yes, sorry, as you suggested I moved to languageTool to add all missing words there. Only that for each missing word the according tag has to be obtained from the tagger, and there are quite a few words the tagger doesn't know. I was wondering if there was any other automatic way to get the PoS info if not provided by the tagger or if I would have to manually annotate the 700 words from my list?

danielnaber commented 6 years ago

If the tagger doesn't know the words, then there's not much you can do other than add them manually. Could you post some examples of unkown words?

GiPfi commented 6 years ago

Sure, so some examples would be: Acryl, Bändchen, befüllen, Freundlichkeit, Inspektion, lila, Mountainbike, PH-Wert, schließ (as Prefix or short verb form), techno, vertikal, x-fach. The file below contains all words: wordsTaggerUnknown.txt

danielnaber commented 6 years ago

Thanks, I've forwarded this to Julian of korrekturen.de, who helps us maintain the dictionary. Maybe he will add those words. But in any case it will take some time until they end up in LT.

GiPfi commented 6 years ago

Cool 😊 I'll let you know if I find more words when checking the splitter on other data sets