UAlbertaALTLab / crk-db

Managing the Plains Cree dictionary database
https://itwewina.altlab.app/
GNU General Public License v3.0
0 stars 3 forks source link

integrate Maskwacîs dictionary #5

Closed dwhieb closed 3 years ago

dwhieb commented 3 years ago

See Notes on the Maskwacîs Dictionary for more information about this data source.

Katie is working on a manual transcription of the MD entries and creating a canonical SRO form for each. This would help in mapping the MD entries to the CW entries, or at least to a canonical SRO representation.

We're not showing MD entries in itwêwina unless they have a match in CW (and even then, only some of them are shown, depending on how much overlap the definitions have with CW).

aarppe commented 3 years ago

A few notes about the previous MD vs. CW comparison.

  1. Katie's aggregation considered not just the words in the definitions but also a more semantic comparison of the similarity of the definitions. While we can consider overlaps of the words in the definitions, there are limits to how much this can stand as a proxy for semantic similarity.

  2. Moreover, Katie undertook the manual comparison from just shy of 6000 dictionary entries in MD which we were able to analyze with the crk FST at the time, and thus link with CW (as the crk FST used CW as the basis for its lemmas). This left around 3000 MD entries which we could not analyze with the crk FST (at the time, since afterwards CW content has been expanded). The reasons for this could have been differences in the MD orthographical form which could not have been overcome with spelling-relaxation, or genuine cases of lemmas/stems missing from CW, or in some rare cases, inflectional forms missing from the crk FST (at the time).

  3. In comparing MD with CRK, the combination of the MD entry and the MD pos is not sufficient, as the MD pos classification was only at the level or verb, noun, etc., which is not enough to contrast orthographically similar CW entries with these general word classes. Thus, we used the English definitions as unique identifiers (on both sides), which became complicated when Arok changed the English definitions for some of the CW entries - this means that we can no longer automatically use Katie's comparison file as such, if contrasting with the latest version of CW. This also poses the continuing challenge we have in comparing the content of MD vs. CW, in that we cannot assume that a previous manual comparison applies, if the CW English definition (i.e. semantic meaning) has been changed.

  4. In the MD TSV sourse, spaces are just spaces, even when separating preverbs/prefixes from stems.

dwhieb commented 3 years ago

Closed by cda303578aa6d7cb3e308d51bc2875c199e7c53c.