atom / spell-check

Spell check Atom package
MIT License
204 stars 121 forks source link

encoding problem with german umlauts in spell-check-test #161

Open stesachse opened 7 years ago

stesachse commented 7 years ago

there is something wrong with the encoding. the wrong word is marked correctly. for me this is more than the original spell-check package has ever done. so thanks for the work :) but in the correction view there are only wrong encoded "umlauts" all marked with a question mark. i can select every entry to replace the wrong spelled word. but the replacement has also the wrong encoded "umlauts".

i have tried to decode this as utf8 or iso-8859-15 but the result is always garbage. here is what perl says. but maybe copy&paste doesn't work at all for this broken string.

% perl -MData::Dump -E 'dd(q{der K�nig hat});'
"der K\xEF\xBF\xBDnig hat"

but perhaps this is correct, because the string keeps the same while doing the following: copying the string out of the editor into the console, run the perl one-liner and copying it back into the editor.

the dev-tools shows no error messages

before correction atom-spell-check-test-encoding-problem-before-correction-20161009t051402 727z

after correction atom-spell-check-test-encoding-problem-after-correction-20161009t051601 095z

% uname -r
4.7.5-200.fc24.x86_64

% lsb_release -s -d
"Fedora release 24 (Twenty Four)"

% localectl
   System Locale: LANG=de_DE.UTF-8
       VC Keymap: de-nodeadkeys
      X11 Layout: de
     X11 Variant: nodeadkeys

% rpm -q hunspell{,-{en{,-US,-GB},de}}
hunspell-1.3.3-10.fc24.x86_64
hunspell-en-0.20140811.1-5.fc24.noarch
hunspell-en-US-0.20140811.1-5.fc24.noarch
hunspell-en-GB-0.20140811.1-5.fc24.noarch
hunspell-de-0.20151222-4.fc24.noarch

% atom -v
Atom    : 1.10.2
Electron: 0.37.8
Chrome  : 49.0.2623.75
Node    : 5.10.0

% apm list -p -b | grep -- ^spell-check
spell-check@0.67.1
spell-check-test@0.77.5

here is the spell-check config

"spell-check-test":
  localePaths: [
    "/usr/share/myspell"
  ]
  locales: [
    "de-DE"
    "en-US"
  ]
laniley commented 7 years ago

I have a similar problem. The german Umlaute are encoded correctly, but they cannot be matched to my text when I add them to the list of known words. They still show up as misspelled.

text

knownwords

krzysieqq commented 7 years ago

I have same problem with Polish language on Linux Mint 18.3 atom_spell

Qrizzz commented 7 years ago

I had the same problem and solved it by saving the .aff and .dic files with utf-8 encoding.

matrixik commented 6 years ago

I can confirm what @Qrizzz found. Converting both files to UTF-8 with enca fixed this problem.

salim-b commented 6 years ago

I had the same problem and solved it by saving the .aff and .dic files with utf-8 encoding.

That works indeed!

But it might not be immediately clear what one has to do exactly to achieve this. Therefore the following step-by-step workaround for (Swiss) German Linux/Ubuntu users (type all the commands into a standard terminal):

  1. List all the system-wide installed hunspell dictionaries:
    hunspell -D
    You might have to install the package hunspell beforehand.

  2. Create a new directory for the custom dictionary files with UTF-8 encoding; I'd recommend:
    mkdir ~/.atom/custom_spellchecker_dictionaries/
    Of course you could also directly convert the original dictionary files. But since I don't know what potential side effects in conjunction with other programs that could have, I wouldn't recommend it.

  3. Create UTF-8 versions of all the relevant dictionary files in the new directory. Example for (Swiss) German and English dictionaries:

    iconv -f ISO-8859-1 -t UTF-8 /usr/share/hunspell/de_CH.aff | sed 's/^SET ISO8859-1$/SET UTF-8/g' > ~/.atom/custom_spellchecker_dictionaries/de_CH.aff
    iconv -f ISO-8859-1 -t UTF-8 /usr/share/hunspell/de_CH.dic > ~/.atom/custom_spellchecker_dictionaries/de_CH.dic
    iconv -f ISO-8859-1 -t UTF-8 /usr/share/hunspell/de_DE.aff | sed 's/^SET ISO8859-1$/SET UTF-8/g' > ~/.atom/custom_spellchecker_dictionaries/de_DE.aff
    iconv -f ISO-8859-1 -t UTF-8 /usr/share/hunspell/de_DE.dic > ~/.atom/custom_spellchecker_dictionaries/de_DE.dic
    iconv -f ISO-8859-1 -t UTF-8 /usr/share/hunspell/en_US.aff | sed 's/^SET ISO8859-1$/SET UTF-8/g' > ~/.atom/custom_spellchecker_dictionaries/en_US.aff
    iconv -f ISO-8859-1 -t UTF-8 /usr/share/hunspell/en_US.dic > ~/.atom/custom_spellchecker_dictionaries/en_US.dic

    You might have to adjust the paths to the ouptut of hunspell -D; the above are valid in Ubuntu 16.04 LTS.

  4. Set the path to the new folder in the option Locale Paths of the Atom spell-check package (note that you can only use absolute paths, so no ~ shortcut). If you followed the recommendation under 1), the path would be: /home/USERNAME/.atom/custom_spellchecker_dictionaries/.

  5. Restart Atom.

BTW: This issue has been opened over a year ago. Why hasn't this been fixed yet? I guess spell-check should just read the .dic and .aff files in their correct encoding and everything would be fine, right? As the Chromium documentation suggests, Atom could

search in the .aff file for the line that begins with "SET" to see which character set it uses.

gesinn-it-gea commented 6 years ago

I have the same issue as @laniley: words with Umlauts that has been added to the list of known words, are still not recognized.

kalsan commented 6 years ago

Same thing under Arch Linux (64 bit), Atom 1.23.2 x64, Spell Check Package 0.73.3

schneiderfelipe commented 6 years ago

I also had this problem under Ubuntu 17.10 (Atom 1.23.3) and what @salim-b recommended above worked perfectly. Thanks!

mhoff commented 6 years ago

Same for Ubuntu 17.10 (Atom 1.24.1); the solution proposed by @salim-b works.

ghost commented 6 years ago

@salim-b's workaround doesn't work for me. Some suggestions are now completely ignored.

The original aff file was also encoded as ISO8859-1, but when converted to UTF-8 the spell checker interprets c and ç as the same letter.

Atom : 1.26.0 Electron: 1.7.11 Chrome : 58.0.3029.110 Node : 7.9.0

kaefert commented 6 years ago

I'm on Linux Mint 18.3 (based on Ubuntu 16.04) and salim-b's workaround worked fine for me.

Although I have to say it would make for a much better experience if it "just worked" without having to apply a workaround yourself.

s-m-e commented 6 years ago

I can also confirm this behavior on openSUSE Leap 42.3. I am reluctant to just convert my dictionaries. Plenty of other software is relying on them. Besides, it will break after the next update ...

s-m-e commented 6 years ago

Tracing the issue back to its roots, it's likely originating in a dependency of spell-check: node-spellchecker. See issue 77 in this project for details.

aswolf commented 6 years ago

What is the status on this? I am running up against the problem where umlauts are rendered properly, but spell-check always thinks words with them are misspelled, despite being a part of the dictionary. For me it is the word Grüneisen, which unfortunately appears everywhere in my work.

Is there any chance of fixing this? I am running atom on a mac OS 10.12.6. Thanks!

s-m-e commented 6 years ago

@aswolf The root cause of this issue has not been solved yet, see latest comment there. Looks like someone there could use some help from an experienced C++ coder with access to a Mac.

DorKeinath commented 5 years ago

Same problem. Solved with salim-b https://github.com/atom/spell-check/issues/161#issuecomment-336653098