UAlbertaALTLab / crk-db

Managing the Plains Cree dictionary database
https://itwewina.altlab.app/
GNU General Public License v3.0
0 stars 3 forks source link

Alternative spellings from CW #123

Open fbanados opened 2 months ago

fbanados commented 2 months ago

\alt tags in toolbox refer to alternative spellings (of the dictionary head) that should be included as part of the lexicographical info presented for entries. generation of linguistInfo.analysis should include this info as well.

aarppe commented 2 months ago

Here are counts of how often entries have alternative spellings, and how many such alternatives per entry:

less crk/Wolvengrey_altlab.toolbox | gawk 'BEGIN { FS="\n"; RS=""; } $0 ~ /\n\\alt/ { nalt++; alt=0; for(i=1; i<=NF; i++) if(match($i,"^.alt ")!=0) alt++; n[alt]++; } END { printf "n(alt)\talt\n"; for(i in n) printf "%i\t%i\n", n[i], i; }'
n(alt)  alt
20918   0
4570    1
777 2
185 3
53  4
24  5
5   6
3   7
aarppe commented 2 months ago

E.g., entry pâmwayês (IPC) has seven \alt fields:

\alt maywês \alt maywêsk \alt mwayê \alt mwayês \alt pâmayas \alt pâmayês \alt pâmoyês

These could be presented in a tabular format, like the following:

Alternatives
maywês
maywêsk
mwayê
mwayês
pâmayas
pâmayês
pâmoyês

I can't imagine what could be meaningful row labels; perhaps this could rather be a single column table.

In any event, Alternatives would then require some relabelings, e.g. Different spellings in plain English, and something else in Cree.

fbanados commented 1 month ago

Implemented:
Screenshot 2024-07-24 at 5 09 59 PM

This is still missing relabellings, especially for cree.

aarppe commented 1 month ago

The relabelings have been added to crk.altlab.tsv.

aarppe commented 1 month ago

I realized that we could have as a first column the dialect that a variant pertains to, as Arok codes that as wCfor Woods Cree and sC for Swampy Cree. For instance, for the CW entry awêýiwa, one could present the following alternative forms in a tabular format:

Dial Alt
pC awîniwa
sC awêńiwa
wC awîthiwa

The default dialect would be pC for Plains Cree. I can add relabelings for the dialect codes, perhaps using the ISO codes for the linguistic relabelings, the full language names for the plain English ones, and then the exonyms for the nêhiyawêwin ones.

aarppe commented 3 weeks ago

We want the following features:

In LEXC, this can be coded as follows:

LEXICON NOUNS_STEMS
...
maci-ayiwiwin:maci-ayiwiwin NI ;
maci-ayiwiwin:mac-âyiwiwin NI_sandhi ;
maci-ayiwiwin:macâyiwiwin NI_sandhi ;
...

LEXICON NI_sandhi
@P.Var.Sandhi@ NI ;

LEXICON NOUN_ENDLEX
...
@R.Var.Sandhi@+Var/Sandhi:@R.Var.Sandhi@ # ;
...

Then, the normative generator for dictionary purposes needs to filter out the variant cases. However, in spell-checking, we do want to recognize and generate those forms, so they will need to remain in the general normative analyzer and generator. The descriptive analyzers will always include the variants.

We will also need to consider the spellrelax rules, so that they do not duplicate analyses for the variants.