bartosz-antosik / vscode-spellright

Multilingual, Offline and Lightweight Spellchecker for Visual Studio Code
Other
364 stars 37 forks source link

New N-API bindings (stable ABI) to the Windows and macOS native spell-checker APIs #595

Open rotemdan opened 3 weeks ago

rotemdan commented 3 weeks ago

Earlier today I looked at the Windows code in the deprecated atom/node-spellchecker package that's used in the extension.

Given I'm a Windows user, and I have a working example in atom/node-spellchecker, and some recent experience with N-API development, I thought I could write a new binding, but make it simpler and more maintainable, since the current one is using outdated and unstable APIs.

It's now published as the windows-spellchecker npm package (MIT license). Repository is here.

The core C++ addon is this single .cpp file.

Main differences to the approach in node-spellchecker:

I'll integrate this one as well, and test it along with the WebAssembly one (which I'm currently continuously testing).

Once it's integrated and working, it'll reduce the need to rebuild the binding over and over again.

The macOS binding can also be rewritten this way, but that's for the future (I don't have a macOS machine, so I can only test it in a VM - by repeatedly syncing source files using SSH - trying to use a development UI in the VM is a completely unusable experience).

rotemdan commented 3 weeks ago

Some updates

rotemdan commented 3 weeks ago

More updates

Next steps: decoupled, reusable, backend module

Now that we've got the addons redone, I think the best way to use them is not actually to integrate them directly to the VSCode extension, but to write a kind of a "backend" library that acts as a kind of "spellcheck provider" or "spellcheck server" (though it wouldn't necessarily run as an actual server).

My general idea is that this provider would be a reusable npm package that would create a background worker, like a new Node.js worker thread (or WebWorker), and load the addons in the worker (and the WASM one as well) in the background (also can be potentially multithreaded and create multiple workers, but that's for the future). The backend would be fully independent from any UI, and can also be tested and used as a plain library, or from a CLI app.

The backend would provide higher-level tools, I'm imagining a kind of "diff-based" approach (inspired by React), where:

In the backend I can also perform more accurate word segmentation, like with the cldr-segmentation library, which I heavily use in Echogarden (a speech toolset I'm developing - which also includes several natural language processing features, as part of it). cldr-segmentation has language-dependent datasets of abbreviations that can more accurately distinguish between . characters used as part of word, to ones that signify sentence ending or other punctuation, as well as various numeric patterns that may include ..

Better word segmentation can also help in reducing the number of false positives due to things like special characters, separators, etc. like is seen in the "Alice in Wonderland" example with Hunspell (it likely happens because Hunspell doesn't properly trim the non-word characters sent to it - it can be fixed by using more accurate word segmentation which would naturally "trim" the unwanted characters from the word))

The word segmentation itself can also be cached in the backend. We'll see.

I don't see the amount of work to implement this to be very large. I think actually redoing the native addons was generally more difficult.

Edit: forgot to mention. Very soon I'm publishing a new language detection library called echo-ld (Echogarden Language Detection library). It is very accurate (for example can differentiate between different Italian dialects and even between Norwegian Nynorsk vs Bokmål - to some limited degree). I'll likely integrate that to the backend as well - it could technically be used detect spelling errors even in a document that contains different languages (by detecting the language of different segments independently and applying the correct dictionary for that language).