barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
713 stars 164 forks source link

dictionary inconsistency #115

Closed mhillendahl closed 11 months ago

mhillendahl commented 2 years ago

Python 3.9.5 Windows 10 x64

Expected Behavior Each language setting only contains itself. Words written in a different language, deliberately or by mistake, are unknown.

Observed Behavior Each language appears to contain itself plus one or more additional language(s). en contains words from English as expected, but also from Spanish, French, and German. es contains words from Spanish as expected, but also from English and French. fr contains words from French as expected, but also from English. pt contains words from Portuguese as expected, but also from English. de contains words from German as expected, but also from English.

Impact Typos in the selected language are undetectable if they incidentally match one of the extra languages. (see console output below)

Steps to Reproduce

spellCheckerTest.py

from spellchecker import SpellChecker

langs = ['en', 'es', 'fr', 'pt', 'de']
words = ['word', 'palabra', 'mot', 'palavra', 'wort', 'notaword']

for i in range(5):
    lang = SpellChecker(language=langs[i])
    known = lang.known(words)
    unknown = lang.unknown(words)
    print(f'{langs[i]} : {", ".join(sorted(known))} ({", ".join(sorted(unknown))})')

input()

console output

en : mot, palabra, word, wort (notaword, palavra)
es : mot, palabra, word (notaword, palavra, wort)
fr : mot, word (notaword, palabra, palavra, wort)
pt : palavra, word (mot, notaword, palabra, wort)
de : word, wort (mot, notaword, palabra, palavra)
barrust commented 2 years ago

The data used to build the dictionaries are pulled from the opensubtitles project. The build process is automated using script/build_dictionaries.py. I don't know of a good method to automatically find and flag each of these "cross-overs" but any help in making the build_dictionaries.py script more robust would be appreciated.

stephencawood commented 1 year ago

There are almost 1.5 million entries in the English dictionary. It's clear that there are far too many. But it's not just French, Spanish, and German. For example:

"開かれた": 1, "闇を切り裂いてさ": 2, "阎东生": 1, "阿昭": 1, "降り出した雪": 10, "限りがあるってのを知っていてムダにしちゃうんだろう": 2,