Handle Unicode strings for suggestions and checking

atom / node-spellchecker

SpellChecker Node Module

http://atom.github.io/node-spellchecker

MIT License

300 stars 108 forks source link

Handle Unicode strings for suggestions and checking #95

Closed dmoonfire closed 5 years ago

dmoonfire commented 6 years ago

Checking files uses the encoding of the Hunspell dictionary file.

Modified JS to C bridge:
- Added two additional transcoders for 1-byte encodings:
  - One to convert to a dictionary's encoding.
  - One to convert from.
- Added logic to encoded and decode into UTF-8 strings as needed.
  - Needed different implentations for Windows, Mac, and Linux.
- Added a setlocale(LC_CTYPE, "en_US.utf8") so word-breaking works properly while looking for splits while checking.
- Added test files for German and French.
- German is in both UTF-8 and ISO-8859 encoding.
- Added tests that exercise Latin and non-Latin characters.
- Added tests to explore some buffer overruns in the Hunspell library.
- Added fixes for those to limit buffer copies.

dmoonfire commented 5 years ago

@50Wliu: Would you look at the results and see what you think? All the tests are running green now.

dmoonfire commented 5 years ago

While the first commit gets the encoding working to/from in the various formats, the second addresses some potential buffer overruns I noticed while working with the code. I felt they were different enough to justify a second commit. Mostly they focus on the Hunspell side of things since that is where I noticed the issues.

dmoonfire commented 5 years ago

@nathansobo: Thank you for looking at this.

nathansobo commented 5 years ago

@dmoonfire Absolutely. This is a huge amount of painstaking work you've done here to make these other dictionaries work. I really appreciate it. :zap:

nathansobo commented 5 years ago

Just going to try to test this out locally.

dmoonfire commented 5 years ago

Because of how it is structured, an ideal test would be on Windows, Linux, and Mac. Sadly, there are three separate paths of code through this thing, plus you need to have non-ASCII characters to test.

nathansobo commented 5 years ago

Ah yes, of course. I'm not going to be able to test this very effectively right now because I don't have a Linux setup. Presumably you have tested this as part of the spell-check package?