bakwc / JamSpell

Modern spell checking library - accurate, fast, multi-language
https://jamspell.com/
MIT License
597 stars 99 forks source link
cpp csharp java ngrams nlp python ruby spellcheck spellchecker spelling-correction

JamSpell

Build Status Release

JamSpell is a spell checking library with following features:

Colab example

JamSpellPro

jamspell.com - check out a new jamspell version with following features

Content

Benchmarks

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed
(words/second)
JamSpell 3.25% 1.27% 79.53% 84.10% 0.64% 4854
Norvig 7.62% 5.00% 46.58% 66.51% 0.69% 395
Hunspell 13.10% 10.33% 47.52% 68.56% 7.14% 163
Dummy 13.14% 13.14% 0.00% 0.00% 0.00% -

Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).

We used following metrics:

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed (words per second)
JamSpell 3.56% 1.27% 72.03% 79.73% 0.50% 5524
Norvig 7.60% 5.30% 35.43% 56.06% 0.45% 647
Hunspell 9.36% 6.44% 39.61% 65.77% 2.95% 284
Dummy 11.16% 11.16% 0.00% 0.00% 0.00% -

More details about reproducing available in "Train" section.

Usage

Python

  1. Install swig3 (usually it is in your distro package manager)

  2. Install jamspell:

    pip install jamspell
  3. Download or train language model

  4. Use it:

import jamspell

corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')

corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

  1. Add jamspell and contrib dirs to your project

  2. Use it:

#include <jamspell/spell_corrector.hpp>

int main(int argc, const char** argv) {

    NJamSpell::TSpellCorrector corrector;
    corrector.LoadLangModel("model.bin");

    corrector.FixFragment(L"I am the begt spell cherken!");
    // "I am the best spell checker!"

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "best", "beat", "belt", "bet", "bent", ... )

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0;
}

Other languages

You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.

HTTP API

Train

To train custom model you need:

  1. Install cmake

  2. Clone and build jamspell:

    git clone https://github.com/bakwc/JamSpell.git
    cd JamSpell
    mkdir build
    cd build
    cmake ..
    make
  3. Prepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)

  4. Train model:

    ./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin
  5. To evaluate spellchecker you can use evaluate/evaluate.py script:

    python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt
  6. You can use evaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.

Download models

Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.