divvun / divvun-gramcheck-web

Grammar checker for web word processors, targeted at minority and indigenous languages, but open for everyone.
GNU General Public License v3.0
1 stars 0 forks source link

Capital initial letter Á lost in correction #26

Closed Trondtr closed 2 years ago

Trondtr commented 3 years ago

In the picture below, the program corrects a spelling error correcty (I had failed to change final -u to -o in compounds). The problem is that the capital Á is not preserved in the suggestion, despite it being used correctly (which probably is irrelevant).

image

Hmm, I now see hfst-ospell has the same behaviour (here with ex. from fao):

echo Spekkulera | hfst-ospell -S -n 3 tools/spellcheckers/fo.zhfst "Spekkulera" is NOT in the lexicon: Corrections for "Spekkulera": spekulera 33.950195

So perhaps there is nothing to do, and this should be ignored. The expected behaviour would have been nice to have, though.

snomos commented 2 years ago

This has nothing to do with hfst-ospell, and it concerns all Sámi letters. There's another bug report about the same here: https://github.com/giellalt/lang-sme/issues/40.

snomos commented 2 years ago

It could also be related to https://github.com/divvun/libdivvun/issues/33. @unhammer - wdyt?

unhammer commented 2 years ago

Eg klarer ikkje å reprodusera libdivvun#33 / lang-sme#40 i libdivvun – @Trondtr skjer dette framleis?

back-end ser rett ut:


$ echo '. Álgukapihttalis lea' |divvun-checker -l se|jq .
{
  "errs": [
    [
      "Álgukapihttalis",
      2,
      17,
      "typo",
      "Ii leat sátnelisttus",
      [
        "Álgokapihttalis",
        "Olgokapihttalis",
        "Čogukapihttalis",
        "Gálgukapihttalis",
        "Áltukapihttalis",
        "Álukapihttalis",
        "Lágukapihttalis",
        "Golgukapihttalis",
        "Illukapihttalis",
        "Mulgukapihttalis"
      ],
      "Čállinmeattáhus"
    ]
  ],
  "text": ". Álgukapihttalis lea"
}

@snomos finst det ein curl-kommando eller noko for å testa serveren som gdocs bruker?

snomos commented 2 years ago

Ja, det finst det: https://divvun.github.io/divvun-api/index.html

T.d.:

curl -X POST -H 'Content-Type: application/json' -i 'https://api-giellalt.uit.no/grammar/se' \
--data '{"text": "Danne lea politijuristtaide eanemus praktihkkalaččat vuogas dan dahkat Čáhcesullos."}' \
2>/dev/null | grep '{' | jq .
{
  "text": "Danne lea politijuristtaide eanemus praktihkkalaččat vuogas dan dahkat Čáhcesullos.",
  "errs": [
    {
      "error_text": "politijuristtaide",
      "start_index": 10,
      "end_index": 27,
      "error_code": "typo",
      "description": "Ii leat sátnelisttus",
      "suggestions": [
        "politiijajuristtaide"
      ],
      "title": "Čállinmeattáhus"
    },
    {
      "error_text": "praktihkkalaččat",
      "start_index": 36,
      "end_index": 52,
      "error_code": "typo",
      "description": "Ii leat sátnelisttus",
      "suggestions": [
        "praktihkalaččat",
        "praktihkalat",
        "praktihkalet",
        "praktihkalit",
        "praktihkalut"
      ],
      "title": "Čállinmeattáhus"
    }
  ]
}
snomos commented 2 years ago

@Trondtr det ville vera svært bra om du kunne leggja ved originalteksten som tekst, ikkje berre eit bilete av han. Helst heile avsnittet. Då er det lett å reprodusera feilen 🙂

snomos commented 2 years ago

@unhammer sjå òg https://github.com/divvun/divvun-api

unhammer commented 2 years ago

OK, så det er noko som er forskjellig mellom API-et (gir liten á) og Debian-pakkene av libdivvun/giella-sme (gir stor Á):

$ curl -Ss -X POST -H 'Content-Type: application/json' \
'https://api-giellalt.uit.no/grammar/se' --data '{"text": ". Álgukapihttalis lea"}' | jq 
{
  "text": ". Álgukapihttalis lea",
  "errs": [
    {
      "error_text": "Álgukapihttalis",
      "start_index": 2,
      "end_index": 17,
      "error_code": "typo",
      "description": "Ii leat sátnelisttus",
      "suggestions": [
        "álgokapihttalis",
        "olgokapihttalis",
        "čogukapihttalis",
        "gálgukapihttalis",
        "áltukapihttalis",
        "álukapihttalis",
        "lágukapihttalis",
        "golgukapihttalis",
        "illukapihttalis",
        "mulgukapihttalis"
      ],
      "title": "Čállinmeattáhus"
    }
  ]
}
snomos commented 2 years ago

Veldig rart - det burde så klart vera likt. API-versjonen bruker nightly-pakkene frå Tino, så eg har vanskeleg for å sjå kor skilnaden kjem frå. @bbqsrc ?

unhammer commented 2 years ago

What's LANG/LC_ALL set to on server @bbqsrc ?

Trondtr commented 2 years ago

Eg la ikkje merke til denne diskusjonen, men ja, Á blir framleis korrigert til á, med teksten

"Nie dat lea. Álgukapihttalis lea dehálaš poeaŋga."

med GrammarChecker på MS Word her på macen min. Men dette ser dokker jo med curl-kommandoen. Ei anna sak er at eg veit sannaleg ikkje korleis dette skal bli gjort, jf. at vi ikkje har mekanisme for å generere ordform med stor forbokstav.

hfst-ospell -S -n 5 tools/spellcheckers/nb.zhfst 
Ortograffi
"Ortograffi" is NOT in the lexicon:
Corrections for "Ortograffi":
ortografi    25.927221

Det er det likevel andre som har: Stavekontrollane med våre fst-ar i LibreOffice oppfører seg som forventa.

snomos commented 2 years ago

@Trondtr hfst-ospell har ikkje innebygt handtering av store og små bokstavar, berre det fst-en klarar. Det er difor ikkje noko poeng å testa med den kommandoen når det gjeld akkurat slike ting. hfst-ospell-office og divvunspell har begge denne funksjonaliteten innebygt.

snomos commented 2 years ago

Her er macOS-versjonen, som gjev stor Á i forslaga:

echo ". Álgukapihttalis lea" | divvun-checker -a tools/grammarcheckers/se.zcheck | jq
{
  "errs": [
    [
      "Álgukapihttalis",
      2,
      17,
      "typo",
      "Ii leat sátnelisttus",
      [
        "Álgokapihttalis",
        "Olgokapihttalis",
        "Golgukapihttalis",
        "Illukapihttalis",
        "Álukapihttalis",
        "Gálgukapihttalis",
        "Lágukapihttalis",
        "Áltukapihttalis",
        "Čogukapihttalis",
        "Hilgukapihttalis"
      ],
      "Čállinmeattáhus"
    ]
  ],
  "text": ". Álgukapihttalis lea"
}
snomos commented 2 years ago

Another example:

curl -Ss -X POST -H 'Content-Type: application/json' 'https://api-giellalt.uit.no/grammar/se' --data \
'{"text": "Čielgaseamus mearka dasa lea ránggáštupmi mii čađahuvvui dan vuostá gii oaččui luovos máná."}' \
| jq 
{
  "text": "Čielgaseamus mearka dasa lea ránggáštupmi mii čađahuvvui dan vuostá gii oaččui luovos máná.",
  "errs": [
    {
      "error_text": "Čielgaseamus",
      "start_index": 0,
      "end_index": 12,
      "error_code": "typo",
      "description": "Ii leat sátnelisttus",
      "suggestions": [
        "čielgasamos",
        "čielgaseamos"
      ],
      "title": "Čállinmeattáhus"
    },
    {
      "error_text": "ránggáštupmi",
      "start_index": 29,
      "end_index": 41,
      "error_code": "typo",
      "description": "Ii leat sátnelisttus",
      "suggestions": [
        "ráŋggáštupmi",
        "ráŋggáštumi",
        "ráŋggáštupmái",
        "ráŋggáštupmin",
        "ráiggášsupmi"
      ],
      "title": "Čállinmeattáhus"
    }
  ]
}
unhammer commented 2 years ago

@Trondtr I guess the libreoffice version doesn't use divvun-api, but is fully offline?

Perhaps even more telling, if the first letter of the suggestion is ascii it gets capitalised (here I changed the input error-word from á to a):

$ curl -Ss -X POST -H 'Content-Type: application/json' 'https://api-giellalt.uit.no/grammar/se' --data '{"text": ". Algukapihttalis lea"}' |jq 
{
  "text": ". Algukapihttalis lea",
  "errs": [
    {
      "error_text": "Algukapihttalis",
      "start_index": 2,
      "end_index": 17,
      "error_code": "typo",
      "description": "Ii leat sátnelisttus",
      "suggestions": [
        "álgokapihttalis",
        "Olgokapihttalis",
        "Illukapihttalis",
        "Lágukapihttalis",
        "čogukapihttalis",
        "Golgukapihttalis",
        "áltukapihttalis",
        "álukapihttalis",
        "Gálgukapihttalis",
        "Mulgukapihttalis"
      ],
      "title": "Čállinmeattáhus"
    }
  ]
}

I tried running the Dockerfile of divvun-api and manually changing env to C.UTF-8 does seem to fix it (works with, doesn't work without). I made a PR.

unhammer commented 2 years ago

@bbqsrc divvun-api is not yet updated on server? The above curl still shows lower-case suggestions

bbqsrc commented 2 years ago

@Eijebong needs to update it.

Eijebong commented 2 years ago

Updated.

> curl -Ss -X POST -H 'Content-Type: application/json' 'https://api-giellalt.uit.no/grammar/se' --data '{"text": ". Algukapihttalis lea"}' |jq

{
  "text": ". Algukapihttalis lea",
  "errs": [
    {
      "error_text": "Algukapihttalis",
      "start_index": 2,
      "end_index": 17,
      "error_code": "typo",
      "description": "Ii leat sátnelisttus",
      "suggestions": [
        "Álgokapihttalis",
        "Olgokapihttalis",
        "Illukapihttalis",
        "Lágukapihttalis",
        "Čogukapihttalis",
        "Golgukapihttalis",
        "Áltukapihttalis",
        "Álukapihttalis",
        "Gálgukapihttalis",
        "Mulgukapihttalis"
      ],
      "title": "Čállinmeattáhus"
    }
  ]
}