Spellcheck evaluation & baseline

Klebert-Engineering / deep-spell-9

Neural Spellcheck, Autocomplete and Fuzzy-match for SQLite FTS5 🤖

MIT License

2 stars 0 forks source link

Spellcheck evaluation & baseline #34

Closed josephbirkner closed 6 years ago

josephbirkner commented 6 years ago

Add a baseline model for the spellchecker. Based on some aggressive Google searching, https://github.com/wolfgarbe/symspell seems to be the algorithm of choice.
Add an evaluation algorithm. For a random selection of randomly corruputed tokens, the algorithm should measure the precision of the first 3 corrections to contain the correct token.

josephbirkner commented 6 years ago

Results

Performance Baseline Metrics with corruption at stddev=0.5,mean=0.4 (only evaluated with city names, but model data generated for whole corpus). For symspell, the max_distance=x/y indicator means maximum_deletes=x, maximum_insertions=y:

symspell max_distance=2/2 (model_data=2.1GB)
 * CITY: corr=0.9980556183585801%

symspell max_distance=1/1 (model_data=611MB)
 * CITY: corr=0.9518878589192855%

symspell max_distance=1/2 (model_data=611MB)
 * CITY: corr=0.9524530861406285%

symspell max_distance=0/2 (model_data=32MB)
 * CITY: corr=0.8888311101062627%

deepspell deepsp_spell-v2_na-lower_lr003_dec70_bat2048_emb8_fw128_bw128_co256-256_dein256-256_drop75.json:
 * CITY: corr=0.9130002260908885%

Percentage indicates share of tests where correct version of corrupted token was within the first 3 correction results.

josephbirkner commented 6 years ago

Note, that during the construction of the corruption-DAWG, corrupted strings must be collected in a TRIE, before they can be collectively processed as a DAWG. This is very memory-intensive, and sensitive to the implementation of the TRIE. The following 4 python trie models were tested:

pygtrie: Extremely bad insertion performance, does not release memory fast enough
marisa-trie: Does not support dynamic insertion.
datrie: Releases memory almost fast enough, but has very slow dynamic insertion performance too. Also, no unicode support. Pro: Supports selective ASCII character subsets.
hat-trie: Very good memory management AND dynamic insertion behavior.

The python DAWG package was selected over pydawg, because it supports attaching values (correct token references) to nodes.

josephbirkner commented 6 years ago

9c1cbd6