alpheios-project / morph-client

Morphology Client Library Interface
ISC License
0 stars 1 forks source link

Whitakers: handle alternate spellings of principal parts #17

Closed balmas closed 5 years ago

balmas commented 6 years ago

See https://github.com/alpheios-project/morphsvc/issues/3

If a parser returns multiple dict elements and a single mean element should the mean be applied to both? Can we recover from this parser error?

balmas commented 6 years ago

Test cases for this:

aberis adero adjuvo (adiuvabo, adiuvante,...) alo ( 'alitus' vs 'altus' in principal parts) amicio ('amixi' vs 'amicui' in principal parts) apta auxilio beatricem (trico vs tricor) blandiatur (blandio vs blandior) caedo (caecidi vs cacidi) cape (here we have some garbage in one of the hdwds "capio, capere, additional, forms") clave, claves, clavis comedo (comessus vs comestus vs comesus) como commoraris (commoro vs commoror) congredior contemplur (contemplo vs contemplor) coque (coquos vs coquus) criminati (crimino vs criminor) cunctor (cunctor; cunctari; cunctatus vs cuncto; cunctare; cunctavi; cunctatus) desino (desino; desinere; desivi; desitus vs desino; desinare; desavi; desatus) duco (some garbage in one of the hdwds "duco; ducere; additional; forms") edo (essus vs esus) emere (emereo vs emereor) excurro (excurro; excurrere; excucurri; excursus vs excurro; excurrere; excurri; excursus) felem (felis vs feles) grammaticae ibis imitandum (imito/imitor) industrius (industriior vs industrior) inferus insuper iocari (joco vs jocor) itinera (itiner vs itiner) lacrimante (lacrimo vs lacrimor) lactis (lac vs lact) lamentari (lamento/lamentor) latrina (latrina/latrinum) lavo (lavatus v lautos v lotus) merendam (mereo vs mereor) mille (millis vs milis) misereror (misereo vs misereror vs miseret) obsonatum (obsono vs obsonor) odi (odeo vs odio) ostendere (ostendo vs ostendeo) pantheum (vs pantheom) physicae (a big mess) poto (potatus vs potus) pradium (prandii vs prandi(i) prodito (prodo vs prodeo) promo (prompsi vs promsi) pungo (pupugi vs pepgui) quasi salit (salo vs saleo) scio (scivi vs scivi(ii) scrutari (scruto vs scrutor) septimia (septim vs septem) sicut (adv vs conjunction) spondeo (spopondi vs spepondi) tueor (tuitus vs tutus) vello (volsi vs velli)

balmas commented 6 years ago

Looking more closely at the whitakers output, it seems that most, if not all of these are due to differing spellings of the principal parts. So we can apply the same meaning to all of them. I guess we need to allow for multiple variations on spellings of principal parts, aggregated in one entry. E.g. here is how we treated it in V1:

screenshot from 2018-08-13 09-17-55

And what we are currently doing in V2 screenshot from 2018-08-13 09-10-41

balmas commented 6 years ago

started work on this. Still to be done: when aggregating lemmas for a lexeme, make sure the lemma that is assigned as the primary lemma is the most frequent one.

balmas commented 6 years ago

@monzug you can test this with the build in https://github.com/alpheios-project/webextension/tree/issues-whitakers-engine

balmas commented 6 years ago

@monzug this can also now be tested in https://github.com/alpheios-project/webextension/tree/qa-2.0.3-3

monzug commented 5 years ago

tested in Chrome in build 2.0.3-5. same of the above examples have been fixed such as aberis or pungo. Others (mille, poto, clavis, coque) could still be merged as they look like alternative spelling Others (apta, cape, desino) have different meaning or different conjugations, so they look ok to me.

monzug commented 5 years ago

Bridget, giving back to you.

balmas commented 5 years ago

yeah, this fix only fixes some of the scenarios. I wasn't sure if all of the words listed above fell into this category. Most do, as you have noted some do not. There are issues on the morphsvc which describe some of the other scenarios I found: https://github.com/alpheios-project/morphsvc/issues/4 https://github.com/alpheios-project/morphsvc/issues/6 https://github.com/alpheios-project/morphsvc/issues/7 https://github.com/alpheios-project/morphsvc/issues/8 https://github.com/alpheios-project/morphsvc/issues/9 https://github.com/alpheios-project/morphsvc/issues/10

Some of these may be problems with the original Whitaker's source code, and some are problems with our wordsxml wrapper on top of it. This fix addresses the scenario where our wordsxml wrapper puts more than one dict entry in a single lexical entry element, gives a single mean and the only difference between the dict entry are in the principal parts, source, age and/or frequency.

There is only so much normalize I can (and really should) do on the client side here. We will have to decide if we are going to open up the old Ada code or find a new parser to fix all of them.

monzug commented 5 years ago

let me know if you want the list of which word has been fixed, which one doesn't look like it could be fixed, and the one that might be merged.

balmas commented 5 years ago

yes that would be great. thanks!

monzug commented 5 years ago

here we are. I added a number next to each word 1) fixed 2) can be fixed 3) different conjugation or meaning or other, do not need to be fixed

aberis 1 adero 1 adjuvo (adiuvabo, adiuvante,...) 1 alo ( 'alitus' vs 'altus' in principal parts) 1 amicio ('amixi' vs 'amicui' in principal parts) 1 apta 3 auxilio 3 beatricem (trico vs tricor) 3 blandiatur (blandio vs blandior) 3 caedo (caecidi vs cacidi) 1 cape (here we have some garbage in one of the hdwds "capio, capere, additional, forms") 3 clave, claves, clavis 2 comedo (comessus vs comestus vs comesus) 1 como 1 commoraris (commoro vs commoror) 2 congredior 3 contemplur (contemplo vs contemplor) 3 coque (coquos vs coquus) 2 criminati (crimino vs criminor) 3 cunctor (cuncto, cunctari; cunctatus vs cuncto; cunctare; cunctavi; cunctatus) 3 desino (desino; desinere; desivi; desitus vs desino; desinare; desavi; desatus) 3 duco (some garbage in one of the hdwds "duco; ducere; additional; forms") 1 edo (essus vs esus) 1 emere (emereo vs emereor) 1 excurro (excurro; excurrere; excucurri; excursus vs excurro; excurrere; excurri; excursus) 1 felem (felis vs feles) 2 grammaticae 2 ibis 2 imitandum (imito/imitor) 1 industrius (industriior vs industrior ) 2 inferus 3 insuper 2 iocari (joco vs jocor) 1 itinera (itiner vs itiner) 1 lacrimante (lacrimo vs lacrimor) 1 lactis (lac vs lact) 1 lamentari (lamento/lamentor) 1 latrina (latrina/latrinum) 3 lavo (lavatus v lautos v lotus) 1 merendam (mereo vs mereor) 1 mille (millis vs milis) 2 misereror (misereo vs misereror vs miseret) 2 obsonatum (obsono vs obsonor) 1 odi (odeo vs odio) 3 ostendere (ostendo vs ostendeo) 1 pantheum (vs pantheom) 2 physicae (a big mess) 2 ---> still a big mess poto (potatus vs potus) 2 pradium (prandii vs prandi(i) 2 prodito (prodo vs prodeo) 2 promo (prompsi vs promsi) 1 pungo (pupugi vs pepgui) 1 quasi 2 salit (salo vs saleo) 2 scio (scivi vs scivi(ii) 3 scrutari (scruto vs scrutor) 3 septimia (septim vs septem) 3 sicut (adv vs conjunction) 2 spondeo (spopondi vs spepondi) 1 tueor (tuitus vs tutus) 1 vello (volsi vs velli) 1 I am going to add one more reperio vs repperio 2

balmas commented 5 years ago

Thank you! Have split the 2s off into a new issue at https://github.com/alpheios-project/morphsvc/issues/12