UAlbertaALTLab / itwewina

Replaced by https://github.com/UAlbertaALTLab/cree-intelligent-dictionary
https://github.com/UAlbertaALTLab/cree-intelligent-dictionary
GNU General Public License v3.0
1 stars 0 forks source link

+Err/Orth tag should be ignored in linguistic analysis #75

Open aarppe opened 5 years ago

aarppe commented 5 years ago

This is a followup of the question concerning êha in #73, but concerns also other forms.

The FST gives the +Err/Orth analysis in two cases:

  1. Non-standard hyphenation in verbs (or nouns), as in ewapamat

image

  1. Expressly listed frequent mispelled forms of mainly particles, such êhâ for êha, cf. src/morphology/stems/particles.lexc - this list is corpus-based:

mâna+Ipc+Err/Orth:mân # ; êkosi+Ipc+Err/Orth:êkos # ; mîna+Ipc+Err/Orth:mîn # ; mâka+Ipc+Err/Orth:mâk # ; êwako+Ipc+Err/Orth:êwak # ; mitoni+Ipc+Err/Orth:miton # ; ...

image

While the spelling relaxation allows for the general recognition of mispelled forms, spell-relaxed analyses are not given the +Err/Orth tag. However, since the spelling relaxation tries out all vowel length combinatorics, the search string can explicitly match with these listed mispelled particles, and provide the correctly spelled form, but also with the +Err/Orth tag - so resulting in two or more analyses. Since the normatized forms in the right analysis section are always correctly spelled, we would not need the +Err/Orth tag in the analysis (or its relabeling as '(Non-standard orthography)'.

The +Err/Orth tag is already excluded in the generation of the normatized form, so it should be ignored in the presentation of the linguistic analysis (relabeled tag sequence) as well.

If as a result of dropping off the +Err/Orth tag there are two otherwise fully matching analyses, these exact duplicates should be collapsed/ignored. In case 1 there are two distinct analyses (when +Err/Orth is ignored), but for êha the two analyses are otherwise exact duplicates.

aarppe commented 5 years ago

We still have +Err/Orth appearing in linguistic analysis as '(non-standard orthography)'. While we recognize words with the descriptive FST that have non-standard use of hyphens, when generating the normative form the +Err/Orth tag should be removed, and the form generated without that tag.

image

In the above example case, the normative form ê-wâpamât is correctly presented for the search string ewapamat - but the linguistic analysis should not contain the user-friendly relabeling of +Err/Orth as '(Non-standard orthography)'

aarppe commented 5 years ago

As an interim measure, we can simply have an empty string for the +Err/Orth tag in the itwewina.relabling file. The tag has its uses on the FST side (but needs to be ignored when the normative form is generated), but there's no added value in having that presented as part of the linguistic analysis as Non-standard orthography, as the generated form is by definition "standard".

.