UAlbertaALTLab / korp-config

0 stars 0 forks source link

diacritic-insensitive search #2

Open fbanados opened 1 month ago

fbanados commented 1 month ago

We'll be adding a feature to do diacritic-insensitive search.

Currently, implemented (to be deployed later) as a regexp-based substitution on the search string, where options are replaced with a proper regexp that CWB should understand:

/a|â|á|à|ā/ -> "[a|â|á|à|ā]"
/e|ê|é|è|ē/ -> "[e|ê|é|è|ē]"
/i|î|í|ì|ī/ -> "[i|î|í|ì|ī]"
/o|ô|ó|ò|ō/ -> "[o|ô|ó|ò|ō]"
/u|û|ú|ù|ū/ -> "[u|û|ú|ù|ū]"

Is this currently sufficient? I'm sure we will require a more elaborate process/regexp once we add tsuut'ina corpora. (tones + making ' optional at the very least)

fbanados commented 1 month ago

@aarppe should I just look for detailed substitutions on the search regexes in the spellrelax regex FST files (src/fst/orthography/spellrelax.regex))?

aarppe commented 1 month ago

For Cree with the first approach above, you would want to include the macron variants for the vowels, too, as well as accented vs. unaccented .

For a more general approach, if we could make use of the information in the spell-relax code, or actually the FST created with the spell-relax code, that could be worthwhile, in particular if you can straight-forwardly plug that in. That approach could then apply for other languages as well; all you'd have to do is plug in the spell-relaxer.

With the spell-relaxer, you can create all the ways in which a word could be misspelled, cf. the 129 options for wapamew:

echo 'wapamew' | hfst-lookup -q src/fst/orthography/spellrelax.lookup.hfst        
wapamew wabamew 0.000000
wapamew wabamewh    0.000000
wapamew wabumew 0.000000
wapamew wabumewh    0.000000
wapamew wabámew 0.000000
wapamew wabámewh    0.000000
wapamew wabâmew 0.000000
wapamew wabâmewh    0.000000
wapamew wahbamew    0.000000
wapamew wahbamewh   0.000000
wapamew wahbumew    0.000000
wapamew wahbumewh   0.000000
wapamew wahbámew    0.000000
wapamew wahbámewh   0.000000
wapamew wahbâmew    0.000000
wapamew wahbâmewh   0.000000
wapamew wahpamew    0.000000
wapamew wahpamewh   0.000000
wapamew wahpumew    0.000000
wapamew wahpumewh   0.000000
wapamew wahpámew    0.000000
wapamew wahpámewh   0.000000
wapamew wahpâmew    0.000000
wapamew wahpâmewh   0.000000
wapamew wapamew 0.000000
wapamew wapamewh    0.000000
wapamew wapumew 0.000000
wapamew wapumewh    0.000000
wapamew wapámew 0.000000
wapamew wapámewh    0.000000
wapamew wapâmew 0.000000
wapamew wapâmewh    0.000000
wapamew wubamew 0.000000
wapamew wubamewh    0.000000
wapamew wubumew 0.000000
wapamew wubumewh    0.000000
wapamew wubámew 0.000000
wapamew wubámewh    0.000000
wapamew wubâmew 0.000000
wapamew wubâmewh    0.000000
wapamew wuhbamew    0.000000
wapamew wuhbamewh   0.000000
wapamew wuhbumew    0.000000
wapamew wuhbumewh   0.000000
wapamew wuhbámew    0.000000
wapamew wuhbámewh   0.000000
wapamew wuhbâmew    0.000000
wapamew wuhbâmewh   0.000000
wapamew wuhpamew    0.000000
wapamew wuhpamewh   0.000000
wapamew wuhpumew    0.000000
wapamew wuhpumewh   0.000000
wapamew wuhpámew    0.000000
wapamew wuhpámewh   0.000000
wapamew wuhpâmew    0.000000
wapamew wuhpâmewh   0.000000
wapamew wupamew 0.000000
wapamew wupamewh    0.000000
wapamew wupumew 0.000000
wapamew wupumewh    0.000000
wapamew wupámew 0.000000
wapamew wupámewh    0.000000
wapamew wupâmew 0.000000
wapamew wupâmewh    0.000000
wapamew wábamew 0.000000
wapamew wábamewh    0.000000
wapamew wábumew 0.000000
wapamew wábumewh    0.000000
wapamew wábámew 0.000000
wapamew wábámewh    0.000000
wapamew wábâmew 0.000000
wapamew wábâmewh    0.000000
wapamew wáhbamew    0.000000
wapamew wáhbamewh   0.000000
wapamew wáhbumew    0.000000
wapamew wáhbumewh   0.000000
wapamew wáhbámew    0.000000
wapamew wáhbámewh   0.000000
wapamew wáhbâmew    0.000000
wapamew wáhbâmewh   0.000000
wapamew wáhpamew    0.000000
wapamew wáhpamewh   0.000000
wapamew wáhpumew    0.000000
wapamew wáhpumewh   0.000000
wapamew wáhpámew    0.000000
wapamew wáhpámewh   0.000000
wapamew wáhpâmew    0.000000
wapamew wáhpâmewh   0.000000
wapamew wápamew 0.000000
wapamew wápamewh    0.000000
wapamew wápumew 0.000000
wapamew wápumewh    0.000000
wapamew wápámew 0.000000
wapamew wápámewh    0.000000
wapamew wápâmew 0.000000
wapamew wápâmewh    0.000000
wapamew wâbamew 0.000000
wapamew wâbamewh    0.000000
wapamew wâbumew 0.000000
wapamew wâbumewh    0.000000
wapamew wâbámew 0.000000
wapamew wâbámewh    0.000000
wapamew wâbâmew 0.000000
wapamew wâbâmewh    0.000000
wapamew wâhbamew    0.000000
wapamew wâhbamewh   0.000000
wapamew wâhbumew    0.000000
wapamew wâhbumewh   0.000000
wapamew wâhbámew    0.000000
wapamew wâhbámewh   0.000000
wapamew wâhbâmew    0.000000
wapamew wâhbâmewh   0.000000
wapamew wâhpamew    0.000000
wapamew wâhpamewh   0.000000
wapamew wâhpumew    0.000000
wapamew wâhpumewh   0.000000
wapamew wâhpámew    0.000000
wapamew wâhpámewh   0.000000
wapamew wâhpâmew    0.000000
wapamew wâhpâmewh   0.000000
wapamew wâpamew 0.000000
wapamew wâpamewh    0.000000
wapamew wâpumew 0.000000
wapamew wâpumewh    0.000000
wapamew wâpámew 0.000000
wapamew wâpámewh    0.000000
wapamew wâpâmew 0.000000
wapamew wâpâmewh    0.000000
fbanados commented 1 month ago

This would require major changes in the korp infrastructure, but it can definitely be explored in the long term.

The current ~hack~ approach makes use of the fact that CWB can understand regexps, and substituting for the appropriate regexp before sending the request to the backend. Eventually this could be done on the backend side, which is just a thin wrapper around CWB. We could also plug the FST there, although I suspect that this could likely generate very long queries that may not work well with CWB, but that is to be explored.

aarppe commented 1 month ago

I think dealing with the diacritics of various sorts is the most important thing for now, and would improve usability really a lot, e.g. that one doesn't need to figure out how to get the hat on <e> in <ekwa>.

aarppe commented 1 month ago

Some of the other regexps are fancier, in that they make use of context, but those could be turned into simple regexps without context, e.g. 0 (->) h || [ a | â | ê | i | î | o | ô ] _ [ c | k | m | n | ... ]and h (->) 0 || [ a | â | ê | i | î | o | ô ] _ [ c | k | m | n | ... ] could be turned into âc -> [âhc|âc], ... and âhc -> [âhc|âc], and so forth.

fbanados commented 1 month ago

I've updated the issue description with the current regexps that https://korp.altlab.dev uses in simple search. This way, when diacritic-insensitive is set, you obtain results:

Screenshot 2024-09-20 at 3 41 34 PM Screenshot 2024-09-20 at 3 41 42 PM Screenshot 2024-09-20 at 3 41 48 PM Screenshot 2024-09-20 at 3 40 02 PM

aarppe commented 1 month ago

For Tsuut'ina, the relaxations would concern accents on vowels (as well as the lower half circle to denote the neutral tone), i.e. a|á|à|ă, the l with/without the bar, i.e. l|ƚ, and exponents of the glottal stop, i.e. '||ʔ|h|?.

aarppe commented 1 month ago

I'm seeing the tick-box, but not the label diacritic-insensitive.

fbanados commented 1 month ago

Can you reload the page? it might have been an issue while I was just recreating the frontend. Let me know if you still cannot see the label

aarppe commented 1 month ago

I tried the regular reload, and am getting the following:

image

but once I guessed that command-shift-R is 'force reload', I got the label as well:

image