codespell-project / codespell

check code for common misspellings
GNU General Public License v2.0
1.92k stars 466 forks source link

Automate suggestions for typos #2093

Open DimitriPapadopoulos opened 3 years ago

DimitriPapadopoulos commented 3 years ago

See https://github.com/codespell-project/codespell/pull/2092#issuecomment-933547257.

Choose a dictionary (the aspell dictionary?) and calculate the Damerau–Levenshtein distance between the typo and all words in that dictionary. Suggest all words at a given distance from the typo, and let the user decide (?) which suggestion to retain in the codespell dictionaries.

DimitriPapadopoulos commented 3 years ago

Here is how to extract the aspell English dictionnary:

$ aspell -d en dump master | aspell -l en expand | awk -F \' '{ print $1; }' | sort | uniq
a
A
AA
AAA
Aachen
aah
[...]
zymurgy
Zyrtec
Zyuganov
Zzz
$ 
DimitriPapadopoulos commented 3 years ago

On the other hand, I don't think we are really interested in the the Damerau–Levenshtein distance as such, it's too theoretical. The above mechanism might suggest possible typos, but the actual probability of some of these typos might be close to null.

We are interested in actual typos that can be found in the wild, and the frequency of these typos depends on:

We should not care about typos that may exist but actually do not exist. Perhaps the above mechanism could be used as a first step, but all possible typos should be vetted by checking if the actually exist in the "wild", the "wild" being the corpus of open source software, or the GitHub subset.

DimitriPapadopoulos commented 1 year ago

This manually updated database looks like a lost battle against lists of misspellings generated by AI.

See for example https://www.spellchecker.net/misspellings by Grammarly. With that said, such a list is not necessarily rocket science:

  1. Get hold of a large corpus of English texts, perhaps from English-Corpora.org. Having access to all GitHub sources would be nice, but I doubt GitHub are willing to open their database to anything but their own GitHub Copilot.
  2. Run aspell on these texts, and create a list of misspellings and fixes.
  3. Only keep statistically significant misspellings.
  4. Curate the list. That's the manual part where AI is actually very helpful.