UAlbertaALTLab / crk-db

Managing the Plains Cree dictionary database
https://itwewina.altlab.app/
GNU General Public License v3.0
0 stars 3 forks source link

Addition of FST analysis as part of entries in importjson does not completely reproduce previous behaviour #122

Open fbanados opened 4 days ago

fbanados commented 4 days ago

(Was "Search regression: my cats / my dogs", but that behaviour has been fixed. Keeping the issue for the major source of inconsistencies that caused the previously observable bug.. See discussion after https://github.com/UAlbertaALTLab/crk-db/issues/122#issuecomment-2211186590)

there is some issue (likely associated with the English Phrase FST not adding an +A tag) that prevents the dev version from correctly providing an inflected form when searching my cats/my dogs. However, the FST behaviours are equivalent, so a different justification for the failure must be identified to make the problem reproducible. Needs fixing.

fbanados commented 4 days ago

Cause is that importjson should include analysis in entries. Migrating issue to crk-db.

fbanados commented 4 days ago

e.g. entry for cats should have:

{ "analysis": [ [], "minôs", [ "+N", "+A", "+Sg" ] ], ...

currently, there's 7247 entries that should have an analysis and do not. Entries like ['oski-kinosêw', 'pwâkamowin', 'iskwâsam', 'ocipwêw', 'kîmîwin', 'pîhtwâkan', 'namêpîsis', 'miskîsik-maskihkiy', 'mihtot', 'matokahp', 'kaskikwâsopaýihcikan', 'macânês', 'wâýicihcêw', 'asinîwiýâkan', 'miskâcis', 'kwayaskosîhowin', 'nawatahikêwin', 'tipahamâkêstamâkêwin', 'côhkâp', 'sâpostawisiwin'].

also, there's 4834 entries that did not have an analysis and now do have one. Entries like ['kîhkâtêyihtâkwan', 'yâyikisâwâtêw', 'kakêhtawêyihtam', 'nîkânipayîstâkêw', 'otamêyihtâkwan', 'pîcicipayiw', 'nanwêyacimiwêw', 'kwayaskopayihêw', 'otânisihkâwêw', 'pahpawipayihow', 'misamêw', 'miyâmâc', 'nôtiniwêw', 'wiyê', 'âyîtahiwêw', 'pakosêyimow', 'iyinito-pahkwêsikan', 'namôya cî', 'kitimâkêyihtowak', 'atâmêyimowin']

fbanados commented 4 days ago

There will be several notes to add about these examples, but for starters, we should add analyses to all entries with a +A suffix tag.

This reduces to an issue with the POStag matching, again.

Therefore, first step here is to actually document and implement an appropriate linguist-based approach for matching POS tags between dictionaries and FSTs. There has been considerable discussion about this (some in emails), that will be added to the appropriate (new) issue.

fbanados commented 4 days ago

After matching against the referenced https://github.com/giellalt/lang-crk/blob/main/tools/shellscripts/add-explicit-fields-to-crkeng.sh, issues with +A suffix are solved. However, there is still work remaining:

fbanados commented 1 day ago

To fix regression, after ensuring that analysis includes +A, a restart of the docker container is required, otherwise search results are not properly sorted (that is, cosine_vector_distance is null instead of 0.0 in some cases, leading to incorrect results). That is a separate bug.

fbanados commented 1 day ago

Most missing entries were Ipc, and a buggy comparison where IPC != Ipc. The 30 leftover are issues with the FST, that requires linguistic feedback. That is mostly heads that are no longer recognized by the strict FST ('nipâskâkow', 'mac-âyiwiwin', 'mac-âtocikêw', 'oski-ôsi', 'mac-âcimoskiw', 'mac-âcimowin', 'mac-âcimow', 'osk-âya', 'osk-âyi', 'mac-âyiwiw', 'mihtos', 'osk-âyisis', 'mac-âcimiwêw', 'mistiko-mahkahkos', 'waskway-ôsi', 'mistahi-ôsi', 'mac-âyisiwiw', 'nipêskâkow', 'môhkocikêwikamik', 'pîhtawêwayiwinisa', 'wâsitêpimâkanihkêw', 'mac-âcimêw', 'âpihtawakimâw', 'mêstakimâw', 'mac-âyisiw', 'iskotêw-ôsi', 'osk-âyis' or heads where the strict analysis produces multiple equal analyses ('akik', 'okosisimâw'). The entry in CW for kôhkomipaninaw has a different POS than that produced by the analysis of the FST. Seems that none of these entry differences have an impact on the presentation in the dictionary, but should be double-checked by a linguist.

aarppe commented 1 day ago

Some analyses:

less crk/Wolvengrey_altlab.toolbox | gawk 'BEGIN { FS="\n"; RS=""; } $0 ~ /ps VTA/ && $0 ~ /gr1[^\n]+(inanimate actor)/ { print $1, $4; }'
\sro akâwêýihtamihikow \def s/he is bothered by a promise s/he made to do s.t. \sro astâhikow \def it frightens s.o.; it causes s.o. to be wary, it worries s.o. \sro câhcâmoskâkow \def s/he is made to sneeze by s.t., it makes her sneeze \sro kipêýihtamiskâkow \def s/he overeats and feels badly, it (e.g. food) has the effect of making him/her feel bad \sro kisiwaskatêskâkow \def it gives s.o. a stomach ache or indigestion \sro kîskwêpêskâkow \def it makes s.o. drunk \sro kîsposkâkow \def it filled s.o. up, it was a filling meal for s.o. \sro mâýiskâkow \def it affects s.o. badly, it has an adverse effect on s.o.; it makes s.o. ill, it makes s.o. react allergically \sro miýoskâkow \def it goes through s.o.'s body with good affect, it does s.o. good (e.g. animate food as actor); it fits s.o. well (e.g. pants) \sro nanâtawiskâkow \def it has a healing effect on him/her \sro nipâskâkow \def it makes s.o. sleep \sro nipêskâkow \def it makes s.o. sleep \sro paspinatikow \def s/he has a narrow escape, s.t. just misses him/her \sro pêkatêskâkow \def it makes him/her belch, burp \sro piscipôskâkow \def it poisons him/her \sro sâposkâkow \def it goes through s.o., it enters s.o.'s body; it purges s.o. \sro tawipaýihikow \def s/he has time

aarppe commented 1 day ago

As for many of the other elements such as mac-âyiwiwin, they are a case where there is orthographical variation at the preverb/prenoun-stem junction, based on reduction in speech. The full form would be maci-âyiwiwin, but because the stem starts with a vowel the preverb-final -i- is often dropped.

We started a discussion with Arok about how to deal with these forms. One would be inclined to choose one variant as the more standard form, and then accept the variants (rather than creating two FST lemmas, if one enumerates both in the LEXC file for stems.) Currently these are sort-of catched by the script, in that the \fststem field is marked, cf.


\sro mac-âyiwiwin
...
\ps NI-1
\def being bad, being mean, being wicked; doing evil; having a bad temper; being a dangerous being
\stm maci-ayiwiwin-
\fststem CHECK:maci-ayiwiwinw?- OLD:mac-âyiwiwin