UAlbertaALTLab / itwewina

Replaced by https://github.com/UAlbertaALTLab/cree-intelligent-dictionary
https://github.com/UAlbertaALTLab/cree-intelligent-dictionary
GNU General Public License v3.0
1 stars 0 forks source link

Autodetect language from query. #4

Open eddieantonio opened 6 years ago

eddieantonio commented 6 years ago

img_20180705_115532

Given I have not explicitly set a search language. Given a search term in the language of my choosing, When I execute the search, Then itwêwina should guess the source language, and present the results in the appropriate from→to pair.

eddieantonio commented 6 years ago

Place ALL the English words into the descriptive analyzer and figure out how many things are actually analyzed.

eddieantonio commented 6 years ago

From #15:

Techniques:

  • multi-input crk-input FST + macron/syllabics count + circumflex default
  • OR language/orthography guesser module based on crk/eng training data

Preliminary tests indicate that we already have the FSTs minimally needed for implementing this (cf. my email 5.7.2018):

It occurred to me to quickly check whether we'd already have the components for an anything-goes for input to a descriptive analyzer creating SRO/circumflex analyses, and indeed that can be composed from two existing FSTs - since the descriptive analyzers incorporate spell-relaxation that allows for basically any variations of writing a vowel correctly or wrongly with any combination of macrons or circumflexes.

hfst-compose -F -1 src/orthography/Cans-to-Latn.compose.hfst -2 src/analyser-gt-desc.hfst -o Cans-and-Latn-to-Latn-gt-desc.hfst

This then analyses the following character combinations: (pê-nipâw analyses incorrect <- result of error in AEW source CSV file that is not relevant to this issue):

hfst-lookup -q Cans-and-Latn-to-Latn-gt-desc.hfst
nipaw
nipaw    nipâw+V+AI+Ind+Prs+3Sg    0.000000

nipāw
nipāw    nipâw+V+AI+Ind+Prs+3Sg    0.000000

ᐘᐸᒪᐟ
ᐘᐸᒪᐟ    wâpamêw+V+TA+Cnj+Prs+2Sg+3SgO    0.000000
ᐘᐸᒪᐟ    wâpamêw+V+TA+Cnj+Prs+3Sg+4Sg/PlO    0.000000

waᐸᒪᐟ
waᐸᒪᐟ    wâpamêw+V+TA+Cnj+Prs+2Sg+3SgO    0.000000
waᐸᒪᐟ    wâpamêw+V+TA+Cnj+Prs+3Sg+4Sg/PlO    0.000000

waᐸnat
waᐸnat    waᐸnat+?    inf

waᐸmat
waᐸmat    wâpamêw+V+TA+Cnj+Prs+2Sg+3SgO    0.000000
waᐸmat    wâpamêw+V+TA+Cnj+Prs+3Sg+4Sg/PlO    0.000000

wâᐸmāt
wâᐸmāt    wâpamêw+V+TA+Cnj+Prs+3Sg+4Sg/PlO    0.000000

ewâᐸmāt
ewâᐸmāt    PV/e+wâpamêw+V+TA+Cnj+Prs+3Sg+4Sg/PlO+Err/Orth    0.000000

e-wâᐸmāt
e-wâᐸmāt    PV/e+wâpamêw+V+TA+Cnj+Prs+3Sg+4Sg/PlO    0.000000

I suppose this could be followed by a check that if the input contains any macron characters, then input is selected as SRO/macron, if any syllabics, then as Cans, and if neither applies, then SRO/syllabics. Or one could choose the input/content option with most representative characters in the input, with SRO/circumflex as default.

The same composition could be applied to the front-end of the other descriptive input FSTs as well.

As for the spelling relaxation, beyond the vowel-length variation rules, we are seeing more and more possible analyses, so either we'd want to add contextual restrictions to when various orthographical variations are likely, or deal with this through some form of a weighted model (which would need misspelled training data).

The other option that we could try out is using either the Ahenakew-Wolfart corpus or the descriptive analyzer to generate a bunch of potential Cree inputs, and then contrast that with English data to a similar extent to train the sort-of language guesser you've already started with.

The current hand-verified morphological gold-standard, thanks to Katie's and Atticus' work, is available for ALTLab use at (we need to move this in SVN under giella/art to have proper version control for the future):

~/Google\ Drive/CreeFST/Wolfart/Wolfart_ucrk_160106.anl_freq.1-18646.170515.txt

For the weighting of the FST, I'd be using a disambiguated version, using the CG parser developed by Katie (which should make better guesses at the contextually appropriate analyses in contrast to some crude general heuristics going for the simplest analyses with least morphological features).

[source]