Content update: inflected word-form entries in dictionaries should not receive independent morphodict entries - Githubissues

UAlbertaALTLab / crk-db

Managing the Plains Cree dictionary database

https://itwewina.altlab.app/

GNU General Public License v3.0

0 stars 3 forks source link

Content update: inflected word-form entries in dictionaries should not receive independent morphodict entries #119

Open aarppe opened 1 week ago

aarppe commented 1 week ago

Entries that are inflected word-forms of other entries, e.g. nîminâniwan and nitâs, should not get their independent entries in morphodict, but should rather become formof cases.

This works for nîminâniwan (--> nîmiw) but not for nitâs (--> mitâs).

When creating the importjson version of the dictionary content, this should either be recognized by the analyzing FST, or then via the \lemma field in the *.toolbox source. See:

Correct behavior
Incorrect behavior (the first two entry blocks) vs. partially correct behavior (the next two entry blocks, though the inflected word-form should show the definition from the dictionary)

Based on the presence of \lemma fields, there are at least 170 cases, and there might be more based on the FST scrutiny.

less crk/dicts/Wolvengrey_altlab.toolbox| gawk 'BEGIN { FS="\n"; RS=""; } { for(i=1; i<=NF; i++) if(index($i,"\\lemma")!=0) print $1, $i; }' | wc -l
     170

fbanados commented 1 week ago

although \lemma field should suffice, preferred approach is to use the FST analyzer

fbanados commented 1 week ago

Another example: nîpit

fbanados commented 1 week ago

This bug reflects a problem at the crk-db level. Migrating the issue.

fbanados commented 1 week ago

Aggregation is not detecting that the entry provided by the FST matches the entry in the database. This is because the FST generates the analysis mitâs+N+I+D+Px1Sg+Sg and mîpit+N+I+D+Px1Sg+Sg, respectively, while the Wolvengrey entries for mitâs and mîpit have both \ps NDI-1. Because the merging analysis does direct string comparisons, it's failing to detect that NID should be considered equal to NDI.

fbanados commented 1 week ago

I would assume there's a high likelihood that these small ordering gaps on word class codes would remain or reappear between sources, so I'm changing the comparison code to check for permutations at the subclass level. Because we are already checking constant length strings at this juncture it should not provide extra overhead. An alternative approach would be to always ensure that all sources follow the same ordering convention, but I think making the importjson generation more resilient is a better approach.

fbanados commented 1 week ago

Also this requires a new importjson, so I'll restart the import mentioned in UAlbertaALTLab/morphodict#1178, which was about 50% done.

aarppe commented 1 week ago

I was thinking about the same thing, that there can be little discrepancies, and while we could fix this either in the FST, the morphodict code, or the database, we'd like to have a language-independent solution, that will work for non-Algonquian languages like Tsuut'ina.

In this respect, what is the current requirement for establishing that an entry is an inflected form of another entry? That is, how is the FST analysis parsed in this respect?

aarppe commented 1 week ago

I'm actually wondering if we should turn this into a linguist problem, but not fully certain. In that we might want to have a linguist-defined mapping between certain FST codes and POS classes, rather than having the code try to figure this out. I.e./E.g. {+N, +A, +D} --> NDA.

Alternatively, I'm wondering whether the comparison should be done with the same type of input, that is comparing the FST analyses of nitâs and mitâs, rather than comparing the FST analysis of nitâs with the p-o-s code of mitâs.

aarppe commented 1 week ago

Also, this is an artifact of us in the computational modeling considering NA and NDA more similar than NDA and NDI.

fbanados commented 5 days ago

I'm actually wondering if we should turn this into a linguist problem, but not fully certain. In that we might want to have a linguist-defined mapping between certain FST codes and POS classes, rather than having the code try to figure this out. I.e./E.g. {+N, +A, +D} --> NDA.

Either would work, but the fundamental problem is whether order is truly necessary for the analysis information (that is, whether it should be a list at all or a set instead).

Alternatively, I'm wondering whether the comparison should be done with the same type of input, that is comparing the FST analyses of nitâs and mitâs, rather than comparing the FST analysis of nitâs with the p-o-s code of mitâs.

The current comparison is done in the isPOSMatch method https://github.com/UAlbertaALTLab/crk-db/blob/aecd/lib/aggregate/index.js. Changing this ordering fixes nîpit: Screenshot 2024-06-28 at 12 28 25 PM

But it did not fix nitâs. nitâs is a different test case, the key difference being that nitâs has multiple entries on the dictionary (@ndi and @nda). Previously, addFormOf gave up in the case of multiple candidates. I've changed the code to attempt to find a unique match depending on the category. The change for nitâs is independent from the decision of making this a linguist problem as it was an issue at the mapping level that happens in a separate pass after the FST information has been collected and added to all entries.

Screenshot 2024-07-02 at 11 15 09 AM

fbanados commented 4 days ago

Updated the importjson on the dev branch of itwêwina to compare. For example, see https://itwewina.altlab.dev/search?q=nîminâniwan https://itwewina.altlab.dev/search?q=nîpit https://itwewina.altlab.dev/search?q=nitâs

fbanados commented 4 days ago

Currently going through the list to ensure that all entries with a lemma are added as wordform. Seems that this is still not the case.

fbanados commented 4 days ago

There are several (different) observable causes for this behaviour after checking the \lemma cases previously discussed. In general, it looks like crk-db is relying on the strict FST and ignoring annotations from Wolvengrey.

kôhtâwînaw shows that multiple definitions appearing in the same toolbox entry are not merged. This may be expected behaviour, but perhaps multiple \def entries should be merged into a same entry, not just the ones separated by a semicolon ;. That is a linguist decision.

Limitations on the FST are manifesting as well:

Given that ý characters are rejected, entries like aýwêpinâniwan in Wolvengrey are only accepted by the relaxed FST. Solution is either to remove ý before analyzing, or to change the FST to accept ý.
Some new Wolvengrey entries are still rejected by the FST: e.g. mêscakâs and mêstakay.

Most likely solution would be to attempt to match first against toolbox's \lemma, and only if that is not available, revert to the FST. Also, I would expect a report on the differences (say, either that the FST generates a different lemma than the toolbox entry or that the FST rejects an entry included in the dictionary) to be a useful report that could be used to debug and guide linguist decisions (e.g., decide whether those are bugs in the toolbox file or at the FST level, limitations of the model that need update, etc.).

fbanados commented 4 days ago

Implementing the change to rely on \lemma has the following impact:

86 entries from AECD stop being merged, in an unrelated bug that must be fixed (currently crk-db gives up on multiple candidate mappings to merge. This should definitely be done in a more regular fashion and not in an ad-hoc way)
Analysis of 5 entries changes from +Px12Pl to +Px1Sg, e.g. kikâwînaw form of nikâwiy
Analysis of ~100 entries changes from +Px1Sg to +PxX, e.g. nacâs form of macâs

fbanados commented 3 days ago

There was an agreement to implement a linguist-provided approach to override the lemma, and use the FST as backup. Ideally, crk-db would also have a way to avoid the FST altogether as an option.

fbanados commented 3 days ago

ý entry issues should be handled by the FST, so discussion about those is to be continued at #115