DuyguA / DEMorphy

German Morphological Analyzer
Other
45 stars 13 forks source link

publish on pypi #9

Open lsmith77 opened 1 year ago

lsmith77 commented 1 year ago

it would be awesome to get the project registered on https://pypi.org/

DuyguA commented 1 year ago

Thanks for your comment! I made the library quite some time ago, I don't remember why I skipped registering to pypi. Though project is Python3 compatible, still I want to do some revisions. I do it when I have time, after that I can register the new package.

lsmith77 commented 1 year ago

that would be amazing. I was planning to try out this project. We are currently using https://github.com/gambolputty/german-nouns but are hoping to find a single library that can handle nouns, verbs and adjectives for German.

lsmith77 commented 1 year ago

FYI our use case is our inclusive writing assistant https://www.witty.works/ and we are looking for ways to make our alternatives grammatically correct.

so we will need to align the word(s) we detected as problematic with the alternatives.

f.e. ambitionierten => engagierten

DuyguA commented 1 year ago

FYI our use case is our inclusive writing assistant https://www.witty.works/ and we are looking for ways to make our alternatives grammatically correct.

so we will need to align the word(s) we detected as problematic with the alternatives.

f.e. ambitionierten => engagierten

Ah OK, got it so you need to match the morphological features as well. OK then, I can update you from here when I'm finished.

lsmith77 commented 1 year ago

exactly. thank you so much for your work

lsmith77 commented 1 year ago

another wrinkle is sicherzustellen which has the spacy lemma sicherstellen.

so if we have an alternative umsetzen we need to transform this to umzustellen or an alternative bewirken needs to be come zu bewirken.

not sure if compound word splitting is within the scope here.

DuyguA commented 1 year ago

another wrinkle is sicherzustellen which has the spacy lemma sicherstellen.

so if we have an alternative umsetzen we need to transform this to umzustellen or an alternative bewirken needs to be come zu bewirken.

not sure if compound word splitting is within the scope here.

No, compound splitting not in the scope indeed. However, the case of sichzustellen should be fairly easy. The lemma is not a substring of the surface form, and there's a zu in between. If you split the surface form from zu and unite the pieces it becomes the lemma soooo you can divide this word as sicher + zu + stellen .

Actually you can use my German corpus to generate a small model. I believe there are many zu , um and be prefixed words in the corpus, you can show those words to (Phonetisaurus)[https://github.com/AdolfVonKleist/Phonetisaurus] . Phonetisaurus is a g2p originally, it can align sequences. So, you train a efficient seq2seq as input sequence are words as chars, and output words as surface forms you want to create. I have a community day on 27th Jan, if you want I can schedule a small consultation to offer some solutions (or better make a tool for compound analysis, I wanted to develop one for German for some time)

lsmith77 commented 1 year ago

Thank you.

As noted it is not too hard to detect that the source word sicherzustellen and with the spacy lemma sicherstellen has a zu injected.

The hard part is then taken an alternative like bewirken, hochstellen and umstellen and then know where the place the zu to align the form, i.e. zu bewirken, hochzustellen and umzustellen. Now be as a prefix is regular (always prepend zu) but um is irregular. Also, hochstellen is the form adjective + verb but for that one first has to split the words to be able to determine if it is the given case.