Open fbanados opened 2 months ago
Here are counts of how often entries have alternative spellings, and how many such alternatives per entry:
less crk/Wolvengrey_altlab.toolbox | gawk 'BEGIN { FS="\n"; RS=""; } $0 ~ /\n\\alt/ { nalt++; alt=0; for(i=1; i<=NF; i++) if(match($i,"^.alt ")!=0) alt++; n[alt]++; } END { printf "n(alt)\talt\n"; for(i in n) printf "%i\t%i\n", n[i], i; }'
n(alt) alt
20918 0
4570 1
777 2
185 3
53 4
24 5
5 6
3 7
E.g., entry pâmwayês (IPC) has seven \alt
fields:
\alt maywês \alt maywêsk \alt mwayê \alt mwayês \alt pâmayas \alt pâmayês \alt pâmoyês
These could be presented in a tabular format, like the following:
Alternatives | |
---|---|
maywês | |
maywêsk | |
mwayê | |
mwayês | |
pâmayas | |
pâmayês | |
pâmoyês |
I can't imagine what could be meaningful row labels; perhaps this could rather be a single column table.
In any event, Alternatives
would then require some relabelings, e.g. Different spellings
in plain English, and something else in Cree.
Implemented:
This is still missing relabellings, especially for cree.
The relabelings have been added to crk.altlab.tsv
.
I realized that we could have as a first column the dialect that a variant pertains to, as Arok codes that as wC
for Woods Cree and sC
for Swampy Cree. For instance, for the CW entry awêýiwa, one could present the following alternative forms in a tabular format:
Dial | Alt |
---|---|
pC | awîniwa |
sC | awêńiwa |
wC | awîthiwa |
The default dialect would be pC
for Plains Cree. I can add relabelings for the dialect codes, perhaps using the ISO codes for the linguistic relabelings, the full language names for the plain English ones, and then the exonyms for the nêhiyawêwin ones.
We want the following features:
In LEXC, this can be coded as follows:
LEXICON NOUNS_STEMS
...
maci-ayiwiwin:maci-ayiwiwin NI ;
maci-ayiwiwin:mac-âyiwiwin NI_sandhi ;
maci-ayiwiwin:macâyiwiwin NI_sandhi ;
...
LEXICON NI_sandhi
@P.Var.Sandhi@ NI ;
LEXICON NOUN_ENDLEX
...
@R.Var.Sandhi@+Var/Sandhi:@R.Var.Sandhi@ # ;
...
Then, the normative generator for dictionary purposes needs to filter out the variant cases. However, in spell-checking, we do want to recognize and generate those forms, so they will need to remain in the general normative analyzer and generator. The descriptive analyzers will always include the variants.
We will also need to consider the spellrelax rules, so that they do not duplicate analyses for the variants.
\alt
tags in toolbox refer to alternative spellings (of the dictionary head) that should be included as part of the lexicographical info presented for entries. generation oflinguistInfo.analysis
should include this info as well.