CW: store separate versions of lemma for FST and itwêwina

dwhieb commented 3 years ago

@dwhieb You do not really want to regularize <ý> to <y> for Plains Cree either, when importing the data from Arok's CW into the dictionary database.

Currently, we retain the <ý> when e.g. we create the stems in LEXC code - this is useful as that allows us to convert ý -> y for Plains Cree, and ý -> {th} for Woods Cree. I don't include <ý> in the lemmas for now, as that might have complications in the use of the FSTs, as accessing <ý> on most regular keyboards is not trivial.

I could imagine us retaining <ý> in the dictionary database as well, and then having an option allowing users to select whether they want to see within itwêwina that marking or not. Arok in fact has on a few occasions been inclined to make that default behavior, but I've managed to successfully argue that I'd be better to offer it as an option - in terms of the simplicity of a regular writing system, we ought not to force users to make explicit distinctions that are primarily historical/linguistic.

For the few Swampy Cree forms with <ń>, I'd convert those to <ý>, but keep a note somewhere of their provenance. Or then we could keep <ń>, but I'd need to add that to the morphophonological rewrite rules.

__Originally posted by @aarppe in https://github.com/UAlbertaALTLab/dictionary-database/pull/30#discussion_r606746619__

dwhieb commented 3 years ago

@aarppe It sounds like we want to retain <ý> for the purposes of LEXC, but itwêwina will need to use the standardized <y>. If that's the case, I can store <ý> versions of headwords in the fstStem field, but store standardized <y> versions in the lemma field, which is intended to be used by itwêwina.

aarppe commented 3 years ago

Following up on our Tech Team discussion, we'd probably need to have all four of 1) regular lemma (no <ý>, also used as the FST lemma); 2) "linguistic" lemma (with <ý>); 3) stem (with <ý>, whichever Arok provides); and 4) fststem (with <ý>, if Arok's stem isn't sufficient for FST purposes).

Then within itwêwina, we can still match the lemma from the FST with lemma in the dictionary entry, and allow for showing the linguistic lemma for anyone who wants to know the location of the dialectal <ý>.

eddieantonio commented 3 years ago

Just to add my 2¢: we can store <ý> in itwêwina as the underlying representation, and display <y> by default. Then we can show <ý> as an orthography option. itwêwina does not need <ý> converted to beforehand, and in fact, it's better if we agree that this conversion (as well as <ń> to ) happens as late as possible: namely, when presenting forms to users.

aarppe commented 3 years ago

As I'm inclined not to require <ý> in the lemmas for the FST, if we keep <ý>, we need to implement in the code its regularization to when invoking the FST. I'm fine with that, but then we'd want to have that <ý> to conversion done in as language-neutral fashion as possible.

As for showing <ý>, yes, we'd want that as an option, but I wouldn't prefer it as default due to the principle of keeping orthographical distinctions to the bare minimum necessary.

UAlbertaALTLab / crk-db

CW: store separate versions of lemma for FST and itwêwina #44