dwyl / english-words

:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion
The Unlicense
10.68k stars 1.85k forks source link

words.txt lacks words that are in words_alpha.txt #93

Open carlosaguilarmelchor opened 3 years ago

carlosaguilarmelchor commented 3 years ago

Example :

# cat words_alpha.txt|grep ^ned                                        
ned
nedder
neddy
neddies
nederlands
# cat words.txt|grep ^ned
nedder
neddies
#

The documentation states that words_alpha.txt is a subset from words.txt which apparently is not the case as of now.

adsteel commented 2 years ago

I think this is just a case sensitivity issue.

$ cat words.txt|grep -i ^ned
NED
Neda
NEDC
Nedda
nedder
Neddy
Neddie
neddies
Neddra
Nederland
Nederlands
Nedi
Nedra
Nedrah
Nedry
Nedrow
Nedrud

While it would be nice for these files to be perfectly formatted, this is a good reminder to clean your data before doing calculations.

JaviSorribes commented 2 years ago

This problem does exist, however. I found 25 missing words with these python3 commands (pasted here for reference):

> import requests
> r = requests.get('https://raw.githubusercontent.com/dwyl/english-words/words.txt')
> r.status_code
200
> w = set(r.text.lower().split())
> len(w)
466546
> r = requests.get('https://raw.githubusercontent.com/dwyl/english-words/words_alpha.txt')
> r.status_code
200
> wa = set(r.text.lower().split())
> len(wa)
370103
> missing = wa - w
> len(missing)
25
> missing
{'preinferredpreinferring', 'stegnosisstegnotic', 'tangantangan', 'false', 'sturdiersturdies', 'peroxidicperoxiding', 'gynecicgynecidal', 'coevolvedcoevolves', 'preobtrudingpreobtrusion', 'kestrelkestrels', 'aliyahaliyahs', 'coracoprocoracoid', 'cylindrocylindric', 'killeekillee', 'antinganting', 'epigonousepigons', 'snailfishessnailflower', 'outwardsoutwarred', 'regeneratoryregeneratress', 'cryptocurrency', 'quadriquadric', 'subsultorysubsultus', 'brigantinebrigantines', 'caducecaducean', 'hypophypophysism'}

Note that there's this other problem of there seemingly being several words that have been merged together somehow, but it's also true that not all words in words_alpha.txt are in words.txt (ex "false").