Suggestions, take a few hundred online e-books and get all words from it too

nelsonic commented 6 years ago

@MyTinyBrain that would be interesting but is kinda "beyond the scope" of this repository.

JonathanRys-ATC commented 6 years ago

Yeah, there are some important words missing here, like "tree" and "trie", which was slightly embarrassing when I was demoing my trie algorithm using this data and asked people to pick a word with "tree/trie" being the obvious choice. Although, sucking in data from ebooks split on spaces and added to a Set would yield a decent result, you might find non-English words too. "A Clockwork Orange", "The Jabberwocky", or Dr. Seuss, for instance, would yield nonsense words that would be hard to filter out. Scraping dictionary.com might be a better approach.

scottchiefbaker commented 5 years ago

I wrote a Perl script to do exactly what you're talking about for another project. It extracts anything that LOOKS like a word, so there is no validation that it isn't a proper name or an onomatopoeia.

dwyl / english-words

Suggestions, take a few hundred online e-books and get all words from it too #44