dwyl / english-words

:memo: A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion
The Unlicense
10.54k stars 1.84k forks source link

Strange words - bug? #59

Open ndvbd opened 5 years ago

ndvbd commented 5 years ago

There are words like "isn" "aren" "wouldn" - smells like a bug?

nelsonic commented 5 years ago

@nadavb I suspect the apostrophes may have been removed by mistake. Feel free to re-add them in. PR gladly accepted.

ndvbd commented 5 years ago

How?

dbrakman commented 5 years ago

@nadavb As far as I can tell, it looks like the process would be ad hoc for this project. Open words.txt in a text editor, use a regex to find pairs of lines like (.*nt$).*n't$ and delete the lines that look bad. Then, remove copies of the deleted lines from words_alpha, update the corresponding zip files (why?), and submit a pull request.

ndvbd commented 5 years ago

But where is the current code that generated words_alpha.txt from words.txt so we can modify it?

dbrakman commented 5 years ago

I don't think it was ever committed. What I see in the history is that someone just added a words_alpha file, and other people modified it directly.

PeskyPotato commented 5 years ago

Also is "giggish" actually a word?

dbrakman commented 5 years ago

@LameLemon I couldn't find a definition for "giggish," and it looks like it came from the original infochimps dataset. You can probably remove it.

To address to the original issue of "are strange words a bug," I think we should say no and close the thread. The underlying reason for the presence of nonwords is the choice of data sources. More carefully curated corpora either cost more or have fewer words.

ndvbd commented 5 years ago

@dbrakman so can you commit it please? Otherwise people can't contribute to it...

dbrakman commented 5 years ago

@nadavb I understand why it should be committed, but I don't have that script. I didn't make these lists.

ndvbd commented 5 years ago

Ahh, I understand. So if someone from the authors see this thread, please commit, thanks...

ndvbd commented 5 years ago

@dbrakman It won't help. The word "aren" is found in words.txt as well. So unless someone show how the file words.txt was extracted from the corpus, I don't think this whole repository is usable at all.

campbellgoe commented 5 years ago

'aaa' isn't a word either

tiptyus82 commented 5 years ago

H

ShahoodulHassan commented 1 year ago

Also is "giggish" actually a word?

Yes it is: https://www.wordnik.com/words/giggish