brown-uk / dict_uk

Project to generate POS tag dictionary for Ukrainian language
GNU General Public License v3.0
546 stars 71 forks source link

[hunspell dictionary] "iconv * 0" lines cause any Latin-characters-words to be treated as correctly spelled #306

Open er13 opened 2 years ago

er13 commented 2 years ago

iconv * 0 lines in distr/hunspell/header/affix_header.txt (from L12 to L73) cause any word consisting of Latin alphabet characters to be converted to a sequence of zeros, which in turn causes the original word to be treated as correctly spelled.

This behavior is incorrect - words consisting of Latin characters are not correct Ukrainian words and should thus be marked as misspelled. This is also very annoying in the multi-language documents as the Ukrainian dictionary effectively disables spell-checking for the Latin alphabet based languages.

Could you (@arysin?) please

arysin commented 2 years ago

This was intentional. Ukrainian texts often contain lots of words in Latin (particularly English proper nouns, abbreviations and terms) and highlighting them creates a lot of noise without much value. The idea for Ukrainian spellchecker to concentrate on Ukrainian words that are misspelled. With this rule the spellchecker should still catch errors where Ukrainian words contain Latin letters (e.g. Latin «i» in «гранiтний») which is very common error.

er13 commented 2 years ago

Hmm, do all Ukrainians always write foreign language words (used untranslated in the Ukrainian language) properly, without errors/typos? Would not that be better to (let the corresponding language dictionary) spellcheck these words instead of marking them as properly spelled and effectively ignoring the errors in them?

As mentioned before my actual use case are multi-language texts containing English, German, Russian, and Ukrainian words at the same time. These behavior of the Ukrainian dictionary causes English and German words to be marked as properly spelled (even if they are not), which is not very helpful.

Could you please think about your decision again and maybe change it? Thanks!

p.s. I would say Latin «i» instead of Ukrainian «і» in words like «гранiтний» would be perfectly caught also without these iconv-rules.

arysin commented 2 years ago

I am open to discussion. But this was a matter of practicality - a long time ago when we just created this hunspell dictionary we didn't have this option and with mixed-language texts (e.g. on wikipedia) it was extremely annoying to see everything in Latin red. In general, there are two use-cases for hunspell: powerful text processors, like LibreOffice/OpenOffice - in this case you can mark words with appropriate language, and then they will be checked with appropriate hunspell dictionary. So current behavior is not in a way. The other case - simple text editors - text fields in Firefox, simple text editors, and other open-source software. They usually don't operate on multiple languages, and in this case, you just want to concentrate on your main language and don't have extra noise coming from words spelled in Latin. But also when I work with big multilingual texts I usually use LibreOffice and LanguageTool. The first one gives me a way to mark text chunks with appropriate language and the second provides grammar checking which is much more powerful than simple dictionary-based check. If you can describe your case where this logic does not work maybe we can come up with a solution. Alternatively we could also create a separate hunspell dictionary with this option off. Sometimes I need to check dictionaries with Russian words in them and the standard Russian hunspell does not work because these texts have accented characters. So I modify their hunspell dictionary to include IGNORE option (of course I have to also convert it from koi8-r to utf-8 but it's worth the effort).

er13 commented 2 years ago

My main use case is Notepad++ with DSpellCheck plugin, i.e. a text editor without a special file format for text documents like .doc or .odt and thus no way to save the annotated language. The logic applied in the DSpellCheck plugin is the following:

I like this logic much more than the behavior of MS-Office/LibreOffice/OpenOffice forcing me to explicitly annotate each word with the language it is written in and effectively increasing my efforts needed to get the text spellchecked.

But even in LibreOffice/OpenOffice (your main use-case) the proper way to do spellchecking (as you write yourself) is to annotate the words coming from languages other than Ukrainian with the original language instead of marking the whole text as being Ukrainian and expecting the Ukrainian dictionary to ignore the foreign words for you. This is kinda absurd, a dictionary intended to support spellchecking actively ignores spelling errors. The main reason for this behavior is, as you said, it would produce a lot of noise, but from my point of view it was more your laziness - the effort needed to annotate every foreign language word was not worth it for you the added value - properly spellchecked every foreign language word.

As to the spellchecking in Firefox. Also in Firefox people use multi-language spellchecking, see e.g. this bug report

By the way, neither English nor German dictionaries have the options to ignore the words written in Cyrillic or Greek or any other non-Latin alphabet.

arysin commented 2 years ago

Multilingual spellchecking was added very recently there, and this comment confirms the practicality of our original approach. Multilingual check is still broken in Firefox and will only be fixed in 103. I'll consider adjusting this for the next release of Ukrainian hunspell dictionaries. I think the solution for your case is simple: open C:\Users\anrysi\AppData\Roaming\Notepad++\plugins\Config\Hunspell\, open uk_UA.aff, remove offending ICONV lines and adjust ICONV count.

er13 commented 2 years ago

I think the solution for your case is simple: open C:\Users\${user}\AppData\Roaming\Notepad++\plugins\Config\Hunspell, open uk_UA.aff, remove offending ICONV lines and adjust ICONV count.

Yeah, works perfectly.

I'll consider adjusting this for the next release of Ukrainian hunspell dictionaries.

Thanks a lot, looking forward for having it in the official release.

arysin commented 2 years ago

BTW the plugin seems to pull pretty old version of hunspell_uk, the newest is here: https://github.com/brown-uk/dict_uk/releases/tag/v5.8.0 We may want to update their location with the latest version

er13 commented 2 years ago

As far as I understand the DSpellCheck plugin has only one source for all dictionaries and that is LibreOffice. It simply doesn't support a separate source per dictionary, which is to be honest quite understandable.

Do you have a process for pushing the updates of the Ukrainian dictionary to LibreOffice (s. https://github.com/LibreOffice/dictionaries/commit/06a28cf2efe2e3fa887912989650dcaaf05f0958, https://github.com/LibreOffice/dictionaries/commit/cbda6f487f9b760acb906b8e1280fe009b0a3461) or do you expect the LibreOffice developers to pull the updates themselves from time to time?

In either case thanks for pointing out the location of the most up-to-date Ukrainian dictionary. I can off course (and will) update it manually.

arysin commented 2 years ago

I am uploading LibreOffice extension with each release here: https://extensions.libreoffice.org/en/extensions/show/ukrainian-spelling-dictionary-and-thesaurus I suspect the developers of that page pull the updates from time to time (not sure where their primary source is though). So we may just need to ping them so they update it now.

er13 commented 2 years ago

Hmm, based on the commit log I would say LibreOffice (even the as-of-now most recent version 7.4.0.1) unfortunately still contains 5.3.1 and not 5.8.0 you mentioned.

Would be great if you could ping the LibreOffice developers and clarify the dictionary update process with them.

arysin commented 2 years ago

I've created https://github.com/LibreOffice/dictionaries/issues/41 ... and https://bugs.documentfoundation.org/show_bug.cgi?id=149980