Open DimitriPapadopoulos opened 3 years ago
Here is how to extract the aspell
English dictionnary:
$ aspell -d en dump master | aspell -l en expand | awk -F \' '{ print $1; }' | sort | uniq
a
A
AA
AAA
Aachen
aah
[...]
zymurgy
Zyrtec
Zyuganov
Zzz
$
On the other hand, I don't think we are really interested in the the Damerau–Levenshtein distance as such, it's too theoretical. The above mechanism might suggest possible typos, but the actual probability of some of these typos might be close to null.
We are interested in actual typos that can be found in the wild, and the frequency of these typos depends on:
We should not care about typos that may exist but actually do not exist. Perhaps the above mechanism could be used as a first step, but all possible typos should be vetted by checking if the actually exist in the "wild", the "wild" being the corpus of open source software, or the GitHub subset.
This manually updated database looks like a lost battle against lists of misspellings generated by AI.
See for example https://www.spellchecker.net/misspellings by Grammarly. With that said, such a list is not necessarily rocket science:
See https://github.com/codespell-project/codespell/pull/2092#issuecomment-933547257.
Choose a dictionary (the aspell dictionary?) and calculate the Damerau–Levenshtein distance between the typo and all words in that dictionary. Suggest all words at a given distance from the typo, and let the user decide (?) which suggestion to retain in the codespell dictionaries.