Open carlosaguilarmelchor opened 3 years ago
I think this is just a case sensitivity issue.
$ cat words.txt|grep -i ^ned
NED
Neda
NEDC
Nedda
nedder
Neddy
Neddie
neddies
Neddra
Nederland
Nederlands
Nedi
Nedra
Nedrah
Nedry
Nedrow
Nedrud
While it would be nice for these files to be perfectly formatted, this is a good reminder to clean your data before doing calculations.
This problem does exist, however. I found 25 missing words with these python3 commands (pasted here for reference):
> import requests
> r = requests.get('https://raw.githubusercontent.com/dwyl/english-words/words.txt')
> r.status_code
200
> w = set(r.text.lower().split())
> len(w)
466546
> r = requests.get('https://raw.githubusercontent.com/dwyl/english-words/words_alpha.txt')
> r.status_code
200
> wa = set(r.text.lower().split())
> len(wa)
370103
> missing = wa - w
> len(missing)
25
> missing
{'preinferredpreinferring', 'stegnosisstegnotic', 'tangantangan', 'false', 'sturdiersturdies', 'peroxidicperoxiding', 'gynecicgynecidal', 'coevolvedcoevolves', 'preobtrudingpreobtrusion', 'kestrelkestrels', 'aliyahaliyahs', 'coracoprocoracoid', 'cylindrocylindric', 'killeekillee', 'antinganting', 'epigonousepigons', 'snailfishessnailflower', 'outwardsoutwarred', 'regeneratoryregeneratress', 'cryptocurrency', 'quadriquadric', 'subsultorysubsultus', 'brigantinebrigantines', 'caducecaducean', 'hypophypophysism'}
Note that there's this other problem of there seemingly being several words that have been merged together somehow, but it's also true that not all words in words_alpha.txt are in words.txt (ex "false").
Example :
The documentation states that words_alpha.txt is a subset from words.txt which apparently is not the case as of now.