Words with accents show as misspelled

ventolinmx commented 6 years ago

Prerequisites

[X ] Put an X between the brackets on this line if you have done all of the following:
- Reproduced the problem in Safe Mode: http://flight-manual.atom.io/hacking-atom/sections/debugging/#using-safe-mode
- Followed all applicable steps in the debugging guide: http://flight-manual.atom.io/hacking-atom/sections/debugging/
- Checked the FAQs on the message board for common solutions: https://discuss.atom.io/c/faq
- Checked that your issue isn't already filed: https://github.com/issues?utf8=✓&q=is%3Aissue+user%3Aatom
- Checked that there is not already an Atom package that provides the described functionality: https://atom.io/packages

Description

On .md and .txt files spanish words with accents showed as misspelled but they are correct. Using aspell es-ES locales.

Steps to Reproduce

Type spanish words with accents.
Save file as .md or .txt.
Activate spanish locale.
Restart.
Open same file.

Expected behavior: Spell-check should recognize correct words with accents.

Actual behavior: Atom underlines all words with accents, although they are correct.

Reproduces how often: Always.

Versions

Atom : 1.23.3 Electron: 1.6.15 Chrome : 56.0.2924.87 Node : 7.4.0

apm 1.18.12 npm 3.10.10 node 6.9.5 x64 atom 1.23.3 python 2.7.13 git 2.11.0

Debian 9.

Additional Information

Tried checking the same file with aspell on command line and works fine. It recognizes words with accents as correct. Also tried different encodings.

lierdakil commented 6 years ago

More likely than not, it's a problem with your dictionary. Atom doesn't deal well with dictionaries that are not UTF-8 encoded.

dvictori commented 6 years ago

Probably related to #212 ?

@lierdakil : How to obtain UTF-8 encoded dictionaries?

dmoonfire commented 6 years ago

@dvictori: I think #212 is definitely going to cause you problems even with a UTF-8 dictionary. There is a defect on node-spellchecker that is trying to fix that. Until that is resolved, I don't know if we can do much more.

lierdakil commented 6 years ago

@dvictori Just find some? e.g. https://github.com/wooorm/dictionaries

lierdakil commented 6 years ago

@dmoonfire, I don't have any issues described in #212. Gentoo Linux, Atom 1.28.0, LANG=ru_RU.UTF-8

I probably would if my de-DE dictionary was, say, cp1252-encoded.

dmoonfire commented 6 years ago

@lierdakil: I stand corrected. Does it show it spelled correctly if you have Löwen?

lierdakil commented 6 years ago

Yes, it does:

Additionally, I've tried converting my dictionary from UTF8 to ISO8859-1 (as is common with extended latin hunspell dictionaries), and here's what I've got: Looks suspiciously similar to #212 I believe.

dmoonfire commented 6 years ago

Oh, I know why you are behaving. I found that the .UTF-8 fixes the problem. However most people don't have that in their language settings so it didn't pick it up correctly. So, my LANG=en_US couldn't handle a UTF-8 dictionary either because of node-spellcheck didn't switch the locale() to UTF-8.

I suspect if you just had LANG=ru_RU it may misbehave.

lierdakil commented 6 years ago

If I just had LANG=ru_RU, IIRC my default system encoding would be KOI8-R, which is a chthonic abomination from the dawn of the computer era that must be killed with fire :) So thanks but no thanks, I quite like my UTF-8 terminals that can handle more than two languages.

I was under the impression that modern Linux distributions prefer UTF-8 locales. Pretty sure at least Arch and Gentoo do.

dvictori commented 6 years ago

@lierdakil Bingo! I used the dictionaries from wooorm and now atom spell check is working. Just hope it won't break any other program. So far, libreoffice and firefox spell check looks fine.

It would be nice though, for users less technically inclined, to be able to use their native dictionary, that comes with the operating system, without having to change the file.

lierdakil commented 6 years ago

@dmoonfire, FWIW, running Atom with env LANG=en_US atom doesn't seem to change the behaviour any. That is, UTF-8 dictionaries are still working. EDIT: LANG=en_US.ISO-8859-1 doesn't seem to have any effect either.

ventolinmx commented 6 years ago

So i installed wooorm's spanish UTF-8 dictionary with npm install dictionary-es and it behaves the same. Do i need to configure this in Atom somewhere to activate the UTF dictionary? I have a special locale mix in Debian, using en_US LANG, but changing this to spanish has the same problem.

lierdakil commented 6 years ago

@ventolinmono, you can point Atom to the directory where you installed the dictionary. Check spell-check settings.

dvictori commented 6 years ago

I just copied the files from wooorm repository to /usr/share/hunspell and renamed to the correct locale. So dictionaries/pt-BR/index.dic from wooorm became /usr/share/hunspell/pt_BR.dic. A very ugly hack, I might say.

dmoonfire commented 6 years ago

I never know about wooorm's dictionaries. They have a MIT license, so that is reasonable. If the UTF-8 is the only thing needed, I'll try creating a couple Atom packages to install specific language dictionaries and see if that behaves; the plugin system for spell-check is designed for that.

wooorm commented 6 years ago

@dmoonfire They do not have an MIT license. Every dictionary comes with a different license!

edusantana commented 6 years ago

problem-with-accent

Here's a problem the I have with this. $LANG = pt_BR.UTF-8 Ubuntu 16.04.

elissonmichael commented 6 years ago

I just copied the files from wooorm repository to /usr/share/hunspell and renamed to the correct locale. So dictionaries/pt-BR/index.dic from wooorm became /usr/share/hunspell/pt_BR.dic. A very ugly hack, I might say.

@edusantana this worked for me!

ghost commented 6 years ago

On archlinux, I solved it by doing: iconv -t UTF-8 -f ISO-8859-1 /usr/share/hunspell/YOURDIC.dic > /usr/share/hunspell/YOURDIC.dic. It's simply an issue of encoding.

ferenczy commented 6 years ago

I would really like to avoid converting my dictionaries into UTF-8 encoding. I'm using original dictionaries from LibreOffice, sharing them between multiple applications and I'm not sure they'll be still working after the conversion. Sure, I can try it but I would like to avoid the conversion every time I update the dictionaries anyway.

The .aff file contains the encoding the dictionary is using at the very first line (in my case it's SET ISO8859-2) so it should be easy to read it and use it without any user intervention.

ghost commented 6 years ago

@ferenczy Definitely. I found these issues: https://github.com/LibreOffice/dictionaries/issues/7 in the libreoffice repo. And https://github.com/atom/node-spellchecker/issues/89 in atom itself

dmoonfire commented 6 years ago

Ideally, a conversion shouldn't be needed because most dictionary files tell you their encoding. I'm trying to get back on this to look at it, I think the underlying problem is at the C++ layer which is no longer my strength, but I have a few obligations that are getting in the way. I want to fix this, mainly because it is driving me nuts too. :)

edusantana commented 5 years ago

@dmoonfire any luck with that? Any work around? I have converted those file to UTF-8 and replaced the SET UFT-8 e added the FLAG UTF-8 but I still have this problem.

dmoonfire commented 5 years ago

@edusantana: Over the last week, I worked on a PR for node-spellchecker which should fix the encoding errors that were happening between Hunspell and Javascript. If all goes well, I can get that verified and rolled into Atom. It should handle most of the accented word problems. It also doesn't require dictionaries to be in UTF-8 format either, so dropping them in should hopefully Just Work™.

https://github.com/atom/node-spellchecker/pull/95

It just took me a while to figure out text encoding on C++ on four different platforms.

rbertoche commented 5 years ago

converting latin1 files to utf8 and changing the format tag did not work for me, as it somehow gets only a subset of the dictionary so it still shows correct words as misspelled.

Is there any way for me to configure a path for the dictionary in a way that this extension will get it? I don't want to risk losing other spellcheck tools as they are working properly

dmoonfire commented 5 years ago

Atom 1.37 has a fix for passing accented characters for spell-checking. It handles dictionaries files that aren't UTF-8 encoded. Could you please check with the beta and see if it solves the problem? Thank you.

edusantana commented 5 years ago

@dmoonfire I will try it... Thanks!!! It works now!!! Look!

atom-37-beta-fix-spell-check

dmoonfire commented 4 years ago

It sounds like this is resolved, so I'm going to close this issue. Feel free to open a new one.

atom / spell-check