Open MyTinyBrain opened 6 years ago
Yeah, there are some important words missing here, like "tree" and "trie", which was slightly embarrassing when I was demoing my trie algorithm using this data and asked people to pick a word with "tree/trie" being the obvious choice. Although, sucking in data from ebooks split on spaces and added to a Set would yield a decent result, you might find non-English words too. "A Clockwork Orange", "The Jabberwocky", or Dr. Seuss, for instance, would yield nonsense words that would be hard to filter out. Scraping dictionary.com might be a better approach.
I wrote a Perl script to do exactly what you're talking about for another project. It extracts anything that LOOKS like a word, so there is no validation that it isn't a proper name or an onomatopoeia.
@MyTinyBrain that would be interesting but is kinda "beyond the scope" of this repository.