Tokenisation issues in LibreOffice

divvun / divvunspell-libreoffice

LibreOffice extension for divvunspell

https://extensions.libreoffice.org/en/extensions/show/27383

Apache License 2.0

2 stars 0 forks source link

Tokenisation issues in LibreOffice #3

Open snomos opened 1 year ago

snomos commented 1 year ago

There seems to be tokenisation issues for languages with Cyrillic letters, cf the following bug report:

https://github.com/giellalt/lang-myv/issues/3

rueter commented 1 year ago

There is an update to giellalt/lang-myv#3

I am using a M2 Ventura 13.0.1 With LibreOffice. the language is Erzya (myv): Version: 7.3.4.2 / LibreOffice Community Build ID: 728fec16bd5f605073805c3c9e7c4212a0120dc5 CPU threads: 8; OS: Mac OS X 13.0.1; UI render: default; VCL: osx Locale: myv-RU (myv_FI.UTF-8); UI: en-US Calc: threaded

There are problems with the full stop ‹.› and ‹...› touching a previous word. The comma, question mark, exclamation mark, quotation marks, parentheses, semicolons and colons do NOT cause a problem.

rueter commented 1 year ago

Meadow Mari (mhr) also has a problem with a full stop touching words. They are recognized.

bbqsrc commented 1 year ago

Yup, thanks for confirming further. Working on a fix. 😄

rueter commented 1 year ago

sms has the same problem

rueter commented 1 year ago

THIS issue does not seem to be one affecting lut. I have drawn hair lines next to accepted words, next to which I have added full stops. The speller accepts them. (Lushootseed has other problems)

Trondtr commented 1 year ago

Just a reminder: This is actually a nasty bug (since almost all sentences end in a period), and it seems to happen for all languages. Here is my sme. Note the three individuao periods after "Juo" compared to the horisontal ellipsis following "Na" (which works):

I should have dropped the easteregg... There is a read line under "buorre" followed by a dot there.

snomos commented 1 year ago

@bbqsrc has looked briefly into this issue, and it seems to be buried deep in the LO code. There was a similar issue with the MS Office speller, and that was fixed. The assumption is thus for now that divvunspell is clean in this regard, and that the issue is elsewhere, ie within the LO integration code or within LO itself. LO is a huge mess of code, mixing Python, Java, C++, one should not be surprised there are bugs when it comes to not-so-standard Unicode text handling 🙂