Annotated Word Datasets

Overview

add raw word score and frequency datasets, and a python script to process them
add processed word datasets annotated with score and frequency in json format

Details

add build_datasets.py and helpers.py at src/data/scripts to process raw datasets, combine them, filter them down to desired sizes, and export to json format
add raw datasets
- Broda List 03.2020 trimmed by Diehl.txt: scored words (source)
- CLUED list to share ranked.txt: scored words (source)
- spreadthewordlist.txt: scored words (source)
- xwordlist.txt: scored words (source)
- crossfire_default.txt: scored words (source)
- unigram_freq.csv: word frequencies (source)
- bad_words.txt: bad words to filter out (source)
add processed datasets, containing words with top scores and frequencies (if provided) for a given target dataset size (deviations in size due to rounding)
- data_small.json: target 1,000 words, actual 977
- data_medium.json: target 5,000 words, actual 4978
- data_large.json: target 10,000 words, actual 9971
- data_xlarge.json: target 50,000 words, actual 49973
- data_giant.json: target 100,000 words, actual 99972
- data_all.json: target all words, actual 637262
known dataset issues
- nonuniform scoring of words shared between multiple raw datasets, resolved by choosing the higher score, puts more weight on datasets that assign higher scores
- overly simple and aggressive filtering of bad words: candidate words are filtered out if it contains any bad word as a substring to help remove plural forms, past tenses, etc
- processed datasets may still contain bad words or phrases due to the aforementioned rudimentary filtering
- potential future bias towards single word answers over phrases due to use of word frequency
- nonuniform distribution shape between dataset sizes due to limited words available for some word lengths
- processed datasets disproportionately exclude people and other proper nouns from marginalized backgrounds due to the raw datasets doing so as well: more intentionally diverse datasets must be added in the future
moved existing datasets to unscored subdirectory
new processed datasets are not actually used by cw_gen yet; a future PR is needed to correct this and #6

Testing

existing test suite all passes
generated command line tool successfully retrieves datasets moved to existing src/data/unscored/

Notes

resolves #13

ashleyzhang216 / crossword-gen

Annotated Word Datasets #14

Overview

Details

Testing

Notes