Closed josephbirkner closed 6 years ago
Results
Performance Baseline Metrics with corruption at stddev=0.5,mean=0.4 (only evaluated with city names, but model data generated for whole corpus). For symspell, the max_distance=x/y
indicator means maximum_deletes=x
, maximum_insertions=y
:
symspell max_distance=2/2 (model_data=2.1GB)
* CITY: corr=0.9980556183585801%
symspell max_distance=1/1 (model_data=611MB)
* CITY: corr=0.9518878589192855%
symspell max_distance=1/2 (model_data=611MB)
* CITY: corr=0.9524530861406285%
symspell max_distance=0/2 (model_data=32MB)
* CITY: corr=0.8888311101062627%
deepspell deepsp_spell-v2_na-lower_lr003_dec70_bat2048_emb8_fw128_bw128_co256-256_dein256-256_drop75.json:
* CITY: corr=0.9130002260908885%
Percentage indicates share of tests where correct version of corrupted token was within the first 3 correction results.
Note, that during the construction of the corruption-DAWG, corrupted strings must be collected in a TRIE, before they can be collectively processed as a DAWG. This is very
memory-intensive, and sensitive to the implementation of the TRIE. The following 4 python trie models were tested:
pygtrie
: Extremely bad insertion performance, does not release memory fast enoughmarisa-trie
: Does not support dynamic insertion.datrie
: Releases memory almost fast enough, but has very slow dynamic insertion performance too. Also, no unicode support. Pro: Supports selective ASCII character subsets.hat-trie
: Very good memory management AND dynamic insertion behavior.The python DAWG
package was selected over pydawg
, because it supports attaching values (correct token references) to nodes.
9c1cbd6
Add a baseline model for the spellchecker. Based on some aggressive Google searching, https://github.com/wolfgarbe/symspell seems to be the algorithm of choice.
Add an evaluation algorithm. For a random selection of randomly corruputed tokens, the algorithm should measure the precision of the first 3 corrections to contain the correct token.