GNUAspell / aspell

http://aspell.net
GNU Lesser General Public License v2.1
243 stars 53 forks source link

Aspell limitations for English words #617

Open johnbumgarner opened 3 years ago

johnbumgarner commented 3 years ago

I'm exploring using the Python package pyenchant in my open source project. Since I'm developing on a Mac the backend of pyenchant is aspell. During testing I noted that some English words are not found, so I'm trying to understand the limitations of aspell.

The code below has 6 English words. It seems that 3 of these words don't exist in aspell dictionaries.

import enchant

words = ["bad", "omen", "smile", 'pneumonoultramicroscopicsilicovolcanoconiosis',
         'supercalifragilisticexpialidocious', 'incomprehensibilities']
for word in words:
    d = enchant.Dict("en_US")
    valid_word = d.check(word)
    print(valid_word)
    True
    True
    True
    False
    False
    False

aspell version info:

aspell --version
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)

Thanks in advance for any assistance.

DimitriPapadopoulos commented 1 year ago

Excellent question.

I think Re: Updating dictionaries gives a few hints. I have started looking into https://github.com/GNUAspell/aspell-lang, it explains how to generate dictionaries that can eventually be uploaded to ftp.gnu.org:

**********************************************************************
         Requirements in order to be upload to ftp.gnu.org
**********************************************************************

The number one requiment is that the dictionary package MUST be made
using "make dist" using the "proc" script as previously desribed.
This will check for a large number of things.

However, this technical documentation does not explain who or which team is currently in charge of running these tools to maintain the dictionaries for each language. You need to search the aspell mailing lists to find these well-hidden teams or individuals.

For English, these might be the web sites you're after:

The first one claims that “This word list is considered both complete and accurate” and points to SCOWL (and friends). The git repository for SCOWL (and friends) is:

DimitriPapadopoulos commented 1 year ago

The strange thing is that all of these words can actually be found in SCOWL (and friends). Make sure you have the most recent dictionaries installed, just in case. I would be interested in your findings, as I have similar issues myself, for example with donut:

>>> import enchant
>>> 
>>> words = ["donut", "donuts"]
>>> dictionary = enchant.Dict("en_US")
>>> 
>>> for word in words:
...     dictionary.check(word)
... 
False
True
>>> 

And some trivia:

DimitriPapadopoulos commented 1 year ago

You may be using the default dictionary size, which is 60 on a scale from 10 to 90. From the aspell man page:

size

(string) The preferred size of the word list. This consists of a two char digit code describing the size of the list, with typical values of: 10=tiny, 20=really small, 30=small, 40=med-small, 50=med, 60=med-large, 70=large, 80=huge, 90=insane.

Have you tried a different size, 80 or even 90 for the kind of uncommon words? Chances are you need to choose proper aspell options, not fix an actual bug.

DimitriPapadopoulos commented 1 year ago

It's not the size of the dictionary after all. The case of donut is interesting: issue https://github.com/en-wl/wordlist/issues/310 gives a glimpse of how words are handled: