Open ndvbd opened 5 years ago
@nadavb I suspect the apostrophes may have been removed by mistake. Feel free to re-add them in. PR gladly accepted.
How?
@nadavb As far as I can tell, it looks like the process would be ad hoc for this project. Open words.txt
in a text editor, use a regex to find pairs of lines like (.*nt$).*n't$
and delete the lines that look bad.
Then, remove copies of the deleted lines from words_alpha
, update the corresponding zip files (why?), and submit a pull request.
But where is the current code that generated words_alpha.txt from words.txt so we can modify it?
I don't think it was ever committed. What I see in the history is that someone just added a words_alpha file, and other people modified it directly.
Also is "giggish" actually a word?
@LameLemon I couldn't find a definition for "giggish," and it looks like it came from the original infochimps dataset. You can probably remove it.
To address to the original issue of "are strange words a bug," I think we should say no and close the thread. The underlying reason for the presence of nonwords is the choice of data sources. More carefully curated corpora either cost more or have fewer words.
@dbrakman so can you commit it please? Otherwise people can't contribute to it...
@nadavb I understand why it should be committed, but I don't have that script. I didn't make these lists.
Ahh, I understand. So if someone from the authors see this thread, please commit, thanks...
@dbrakman It won't help. The word "aren" is found in words.txt as well. So unless someone show how the file words.txt was extracted from the corpus, I don't think this whole repository is usable at all.
'aaa' isn't a word either
H
Also is "giggish" actually a word?
Yes it is: https://www.wordnik.com/words/giggish
There are words like "isn" "aren" "wouldn" - smells like a bug?