UAlbertaALTLab / crk-db

Managing the Plains Cree dictionary database
https://itwewina.altlab.app/
GNU General Public License v3.0
0 stars 3 forks source link

aggregate data sources into a unified Plains Cree lexical database #1

Closed aarppe closed 3 years ago

aarppe commented 4 years ago

This is a meta-issue for tracking initial aggregation of the current dictionary data sources. Other sources may be added later, but this issue can be considered complete once the following issues are done, and we have an initial aggregation process and ALTLab-specific database in place.

To Do

Notes

(See also these Database Specification Notes on Google Drive.)

Reviewing a small number of differences between a version of CW from this January and from 2014, I'm starting to more and more think whether the manual evaluation of CW vs. MD content is realistic, given that AEW is updating CW continuously, and MD content will be expanded as well with the "new" words collected in the recordings.

In principle, we'd need to run through every new version of CW, and then figure out which CW entries have changed or are new ones, and then somehow automatically contrast these with the entire content of MD as to whether some CW entries might have substantial semantic overlap with MD. Only when the English words in the definitions are exactly the same, or when the English words of one definition are a complete subset of the English words in the other dictionary could we automatically decide that the entries in the two dictionaries can be "merged". In all other cases, this would need to be assessed manually by a linguist. E.g. we know of some dictionary entries that concern the same sense, but do not have practically any overlap in their definitions, for instance:

MD: mêscihew <- meschihew ᒣᐢᒋᐦᐁᐤ vp He kills them all, wipes them out. CW: mêscihew <- s/he kills s.o. off, s/he annihilates s.o., s/he exterminates s.o. (VTA-1)

An alternative is that 1) we base matching dictionary entries from multiple sources based on the assumption that the combination of the dictionary entry head and the inflectional category is unique in all sources, so can be used to map 1:1 potentially similar dictionary entries; and 2) we merge dictionary entries from multiple sources only when that can be done automatically, i.e. when the content words in the English definitions overlap completely (or almost completely), i.e. the English content words in one source can all be found in the dictionary entry in the other source, and vice versa.

Thus, we leave the manual assessment of sense similarity until later. But in the interim, what we need to do in any event, also as a necessary step for eventual manual assessment, is a) the standardization of the orthographical form of the lexical entries in non-CW sources, b) the assignment of inflectional category for all dictionary entries, and c) the assignment of the stem for all dictionary entries (at least those not in CW).

This strategy would allow us to aggregate not only MD but also AECD. In the case of three sources, the merging consideration would be applied pairwise, in that it might be that the English definitions of only two sources are practically identical, but a third source might have narrower or broader definitions.

aarppe commented 4 years ago

Some very initial scrutiny of the AECD content indicates that quite a deal of orthographical standardization will be needed, e.g.

CW: wâpiw | [generally in negative phrases:] s/he is blind vs. AE: wâpiw | ᐋᐧᐱᐤ | wâpiwak | (vai) | S/he sees. vs. MD: Ø CW: wâpahtam | s/he sees s.t., s/he witnesses s.t. vs. AE: wâpatam | ᐋᐧᐸᑕᒼ |   | (vti) | S/he sees it. vs. MD: wapahtam CW: wâpamêw | s/he sees s.o., s/he witnesses s.o. vs. AE: wâpamew | ᐋᐧᐸᒣᐤ |   | (vta) | S/he sees him/her. vs. MD: wapimew

CW: mîcisow | s/he eats, s/he has a meal vs. AE: micisiw | ᒥᒋᓯᐤ | micisiwak | (vai) | S/he eats. vs. MD: mitsow | He eats CW: mîciw | vs. AE: miciw | ᒥᒋᐤ | miciwak | (vai) | S/he eats it or consumes it; s/he munches it. vs. MD: michiw | He eats it CW: mowêw vs. AE: mowew | ᒧᐁᐧᐤ | mowewak | (vta) | S/he eats them. vs. MD: mowew | He eats him.

As can be seen in the above, vowel-length marking is not entirely complete, and there is variation in marking pre-stop aspiration, spelling /ou/ with <iw> or <ow> and /ts/ with <c>, <t(i)s> and <ch>.

dwhieb commented 3 years ago

Database Specification Notes