divvun / divvun-gramcheck-web

Grammar checker for web word processors, targeted at minority and indigenous languages, but open for everyone.
GNU General Public License v3.0
1 stars 0 forks source link

SMN document causes check to die or time out #91

Open snomos opened 3 months ago

snomos commented 3 months ago

A document spanning 21 pages (sent off-line for privacy reasons) causes the Google Docs plugin to die with the message "ScriptError: Oversteg maksmal kjøretid" (essentially time-out) after about 4-5 minutes, and the Word plugin to just resign with no errors found (after a much shorter amount of time).

Running the document (as plain text) through the command line checker locally takes about 1,5 minutes, and returns several hundred error messages (some empty). That is, it works on the command line, it just takes some time.

snomos commented 3 months ago

I tested the API server for the smn end point as follows:

 curl -X POST -H 'Content-Type: application/json' -i 'https://api-giellalt.uit.no/grammar/smn' --data '{"text": "Danne lea."}' | grep text | jq .                       
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   451  100   429  100    22   1097     56 --:--:-- --:--:-- --:--:--  1156

which returned the following:

{
  "text": "Danne lea.",
  "errs": [
    {
      "error_text": "Danne",
      "start_index": 0,
      "end_index": 5,
      "error_code": "typo",
      "description": "Sääni \"Danne\" váilu tivvoomohjelm sänilistoost.",
      "suggestions": [
        "Janne",
        "Sanne",
        "Lanne"
      ],
      "title": "Časkemfeilâ"
    },
    {
      "error_text": "lea",
      "start_index": 6,
      "end_index": 9,
      "error_code": "typo",
      "description": "Sääni \"lea\" váilu tivvoomohjelm sänilistoost.",
      "suggestions": [
        "lii",
        "lâi"
      ],
      "title": "Časkemfeilâ"
    }
  ]
}

That it, the endpoint works.

I then tried the same using a json-ivied version of the document as the request text, but that was killed immediately with the following message:

HTTP/2 413 
content-type: text/plain; charset=utf-8
date: Wed, 22 May 2024 07:48:33 GMT
server: Caddy
content-length: 40

Json payload size is bigger than allowed
zoomix commented 3 months ago

I had a look, and basically we'd need to boost the heck out the performance of the grammar checker. It uses one thread currently, which.. doesn't scale great.

Each document is split into chunks and each chunk is grammar-checked individually (*). A chunk is basically a paragraph. A 21 page document is gonna have a lot of paragraphs. Each non-empty paragraph is going to take some time to grammar check. (**)

On the client side, from both google docs' and msword's perspective, you init one call "doGrammarCheck()". That one call spawns all the other calls. So the time you wait is the grand total of all calls.

That sucks.

And even if it didn't, 21 pages worth of grammar results would be a nightmare to deal with.

I don't think it makes sense to rewrite the grammar-checker on the backend to be multithreaded, because the frontend is never going to be. If we could make it 10 times quicker on a single thread, that would be great.

We COULD rewrite the frontend to use an iterative approach. That is, you start a grammar checker and it checks one paragraph at a time and stops when it finds an error. You can then choose to "ignore" the error and continue grammar checking your next paragraph, or you can fix it yourself. That way you'd only wait for the time it takes to check one paragraph which is way way quicker. And you don't have to deal with an infinite scroll of grammar errors.

Improving grammar checker performance still leaves us with a long-ass list of grammar errors. Rewriting the frontend to be iterative is.. the way to go really.

But ... it isn't exactly what you would call a quick fix.

(*) We can't send too big chunks of text to the grammar checker because we can't deal with the response. (**) See the request time (the last number, in seconds) from the access log:

[2024-05-22T12:43:05Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 2935 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 9.941739
[2024-05-22T12:43:05Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 21 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 0.000605
[2024-05-22T12:43:24Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 21 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 0.000567
[2024-05-22T12:43:29Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 501 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 4.933951
[2024-05-22T12:43:51Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 4526 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 21.499002
[2024-05-22T12:43:57Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 1588 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 5.429406
[2024-05-22T12:44:05Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 2523 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 8.414829
[2024-05-22T12:45:03Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 5628 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 57.218160
[2024-05-22T12:45:41Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 21 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 0.000810
[2024-05-22T12:45:41Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 21 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 0.000521
[2024-05-22T12:45:56Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 3370 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 14.740103
[2024-05-22T12:46:26Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 2800 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 15.969536
[2024-05-22T12:46:54Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 5682 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 18.090149
[2024-05-22T12:46:54Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 21 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 0.000577
[2024-05-22T12:47:38Z INFO  actix_web::middleware::logger] 94.254.70.114 "POST /grammar/smn HTTP/1.1" 200 9660 "-" "Mozilla/5.0 (compatible; Google-Apps-Script)" 43.505563
snomos commented 3 months ago

Ok. Thanks for the analysis. The simple solution for the users right now is then to check smaller sections of a document at the time, basically by copying portions of the whole text to another document, check, correct and copy back to the original document. I will tell this to the person reporting the bug.

The grammar checker code is being rewritten in a separate project by Brendan et co. The goal of that project is to make a version that can run stand-alone on PC's and Mac's (and possibly iPhones/iPads and Android systems). I expect the outcome of that will be a significantly faster grammar checker, at least due to using a much faster speller engine (divvunspell instead of hfst-ospell, divvunspell is roughly 10x faster). We already know that the speller part of the pipeline is the slowest one, mainly due to generating suggestions.

This is to say that: a) we have a stop-gap solution right now that we can inform users about (not ideal, but it works); and b) we won't do any changes to the grammar checker front-end or back-end until we have the new codebase running on the server. Release of the new grammar checker has been planned to last week of June.