Open Orivoir opened 3 years ago
Nice work but from 350000+ lines only around 2500 survived? Seems like the parameters used have been a little too strict...
I used less strict filter with frequency
data of same API for ~30 000
words results but i think again that some words not real english words. see 4971374b
~30 000
would be closer to reality but it appears to have duplicated a bunch of words as well which were not duplicated on the original words_alpha.txt
. See bedrock. bedroll, bedroom, bedspread, bedstead as examples...
Nice work but from 350000+ lines only around 2500 survived? Seems like the parameters used have been a little too strict...
The API is free for 2500 words per day. That is probably why....
The API is free for 2500 words per day. That is probably why....
@Orivoir did get ~30 000
words just by using different parameters, so that was probably not it.
maybe it kills too much words. for example: blacklist
is in it, but whitelist
doesn't
sale
not in, but sales
is in
white lives matter, too [:joke:]
Hi all, I have run the words_alpha.txt through the "nltk" python library. Total words are 210693. This seems to be a bit better, but I have noticed there are still a few oddities in there (maybe things like common abbreviations remain, which aren't actual words). But overall I think this has cleaned out any non-english words.
@SDidge appreciate the share!
@SDidge At first glance I can't seem to find any non-english words on the file so I'd say this one is the cleanest file so far, nice work!
Hi all, I have run the words_alpha.txt through the "nltk" python library. Total words are 210693. This seems to be a bit better, but I have noticed there are still a few oddities in there (maybe things like common abbreviations remain, which aren't actual words). But overall I think this has cleaned out any non-english words.
@SDidge , what exactly did you use from the NLTK library to check the list of words?
@Timokasse , I just checked if the word existed in the "words" corpus
E.g.
from nltk.corpus import words
word for words_alpha if word in words
Something like this
Add file
words_alpha_clean.txt
that a copy ofwords_alpha.txt
but the words that not exists in english has been removed. The sort has been effectuate with theAPI
of wordsapi that allow the search of words in english, from a script i've call the API for each word, and during not exists word i've remove word from a file. You can find the doc API here. The exact filter of a word is based onfrequency
data of APIThe documentation indicate this below text for frequency data: