Open aarppe opened 1 week ago
although \lemma field should suffice, preferred approach is to use the FST analyzer
This bug reflects a problem at the crk-db
level. Migrating the issue.
Aggregation is not detecting that the entry provided by the FST matches the entry in the database. This is because the FST generates the analysis mitâs+N+I+D+Px1Sg+Sg
and mîpit+N+I+D+Px1Sg+Sg
, respectively, while the Wolvengrey entries for mitâs
and mîpit
have both \ps NDI-1
. Because the merging analysis does direct string comparisons, it's failing to detect that NID
should be considered equal to NDI
.
I would assume there's a high likelihood that these small ordering gaps on word class codes would remain or reappear between sources, so I'm changing the comparison code to check for permutations at the subclass level. Because we are already checking constant length strings at this juncture it should not provide extra overhead. An alternative approach would be to always ensure that all sources follow the same ordering convention, but I think making the importjson
generation more resilient is a better approach.
Also this requires a new importjson
, so I'll restart the import mentioned in UAlbertaALTLab/morphodict#1178, which was about 50% done.
I was thinking about the same thing, that there can be little discrepancies, and while we could fix this either in the FST, the morphodict code, or the database, we'd like to have a language-independent solution, that will work for non-Algonquian languages like Tsuut'ina.
In this respect, what is the current requirement for establishing that an entry is an inflected form of another entry? That is, how is the FST analysis parsed in this respect?
I'm actually wondering if we should turn this into a linguist problem, but not fully certain. In that we might want to have a linguist-defined mapping between certain FST codes and POS classes, rather than having the code try to figure this out. I.e./E.g. {+N, +A, +D}
--> NDA
.
Alternatively, I'm wondering whether the comparison should be done with the same type of input, that is comparing the FST analyses of nitâs and mitâs, rather than comparing the FST analysis of nitâs with the p-o-s code of mitâs.
Also, this is an artifact of us in the computational modeling considering NA and NDA more similar than NDA and NDI.
I'm actually wondering if we should turn this into a linguist problem, but not fully certain. In that we might want to have a linguist-defined mapping between certain FST codes and POS classes, rather than having the code try to figure this out. I.e./E.g.
{+N, +A, +D}
-->NDA
.
Either would work, but the fundamental problem is whether order is truly necessary for the analysis information (that is, whether it should be a list at all or a set instead).
Alternatively, I'm wondering whether the comparison should be done with the same type of input, that is comparing the FST analyses of nitâs and mitâs, rather than comparing the FST analysis of nitâs with the p-o-s code of mitâs.
The current comparison is done in the isPOSMatch
method https://github.com/UAlbertaALTLab/crk-db/blob/aecd/lib/aggregate/index.js. Changing this ordering fixes nîpit
:
But it did not fix nitâs
. nitâs
is a different test case, the key difference being that nitâs
has multiple entries on the dictionary (@ndi
and @nda
). Previously, addFormOf
gave up in the case of multiple candidates. I've changed the code to attempt to find a unique match depending on the category. The change for nitâs
is independent from the decision of making this a linguist problem as it was an issue at the mapping level that happens in a separate pass after the FST information has been collected and added to all entries.
Updated the importjson
on the dev branch of itwêwina to compare. For example, see
https://itwewina.altlab.dev/search?q=nîminâniwan
https://itwewina.altlab.dev/search?q=nîpit
https://itwewina.altlab.dev/search?q=nitâs
Currently going through the list to ensure that all entries with a lemma are added as wordform. Seems that this is still not the case.
There are several (different) observable causes for this behaviour after checking the \lemma
cases previously discussed.
In general, it looks like crk-db
is relying on the strict FST and ignoring annotations from Wolvengrey.
kôhtâwînaw
shows that multiple definitions appearing in the same toolbox entry are not merged. This may be expected behaviour, but perhaps multiple \def
entries should be merged into a same entry, not just the ones separated by a semicolon ;
. That is a linguist decision.Limitations on the FST are manifesting as well:
ý
characters are rejected, entries like aýwêpinâniwan
in Wolvengrey are only accepted by the relaxed FST. Solution is either to remove ý
before analyzing, or to change the FST to accept ý
.mêscakâs
and mêstakay
.Most likely solution would be to attempt to match first against toolbox's \lemma
, and only if that is not available, revert to the FST. Also, I would expect a report on the differences (say, either that the FST generates a different lemma than the toolbox entry or that the FST rejects an entry included in the dictionary) to be a useful report that could be used to debug and guide linguist decisions (e.g., decide whether those are bugs in the toolbox file or at the FST level, limitations of the model that need update, etc.).
Implementing the change to rely on \lemma
has the following impact:
+Px12Pl
to +Px1Sg
, e.g. kikâwînaw
form of nikâwiy
+Px1Sg
to +PxX
, e.g. nacâs
form of macâs
There was an agreement to implement a linguist-provided approach to override the lemma, and use the FST as backup. Ideally, crk-db
would also have a way to avoid the FST altogether as an option.
ý
entry issues should be handled by the FST, so discussion about those is to be continued at #115
Entries that are inflected word-forms of other entries, e.g. nîminâniwan and nitâs, should not get their independent entries in morphodict, but should rather become
formof
cases.This works for nîminâniwan (--> nîmiw) but not for nitâs (--> mitâs).
When creating the
importjson
version of the dictionary content, this should either be recognized by the analyzing FST, or then via the\lemma
field in the *.toolbox source. See:Correct behavior
Incorrect behavior (the first two entry blocks) vs. partially correct behavior (the next two entry blocks, though the inflected word-form should show the definition from the dictionary)
Based on the presence of
\lemma
fields, there are at least 170 cases, and there might be more based on the FST scrutiny.