add raw word score and frequency datasets, and a python script to process them
add processed word datasets annotated with score and frequency in json format
Details
add build_datasets.py and helpers.py at src/data/scripts to process raw datasets, combine them, filter them down to desired sizes, and export to json format
add raw datasets
Broda List 03.2020 trimmed by Diehl.txt: scored words (source)
CLUED list to share ranked.txt: scored words (source)
add processed datasets, containing words with top scores and frequencies (if provided) for a given target dataset size (deviations in size due to rounding)
data_small.json: target 1,000 words, actual 977
data_medium.json: target 5,000 words, actual 4978
data_large.json: target 10,000 words, actual 9971
data_xlarge.json: target 50,000 words, actual 49973
data_giant.json: target 100,000 words, actual 99972
data_all.json: target all words, actual 637262
known dataset issues
nonuniform scoring of words shared between multiple raw datasets, resolved by choosing the higher score, puts more weight on datasets that assign higher scores
overly simple and aggressive filtering of bad words: candidate words are filtered out if it contains any bad word as a substring to help remove plural forms, past tenses, etc
processed datasets may still contain bad words or phrases due to the aforementioned rudimentary filtering
potential future bias towards single word answers over phrases due to use of word frequency
nonuniform distribution shape between dataset sizes due to limited words available for some word lengths
processed datasets disproportionately exclude people and other proper nouns from marginalized backgrounds due to the raw datasets doing so as well: more intentionally diverse datasets must be added in the future
moved existing datasets to unscored subdirectory
new processed datasets are not actually used by cw_gen yet; a future PR is needed to correct this and #6
Testing
existing test suite all passes
generated command line tool successfully retrieves datasets moved to existing src/data/unscored/
Overview
json
formatDetails
build_datasets.py
andhelpers.py
atsrc/data/scripts
to process raw datasets, combine them, filter them down to desired sizes, and export tojson
formatBroda List 03.2020 trimmed by Diehl.txt
: scored words (source)CLUED list to share ranked.txt
: scored words (source)spreadthewordlist.txt
: scored words (source)xwordlist.txt
: scored words (source)crossfire_default.txt
: scored words (source)unigram_freq.csv
: word frequencies (source)bad_words.txt
: bad words to filter out (source)data_small.json
: target 1,000 words, actual 977data_medium.json
: target 5,000 words, actual 4978data_large.json
: target 10,000 words, actual 9971data_xlarge.json
: target 50,000 words, actual 49973data_giant.json
: target 100,000 words, actual 99972data_all.json
: target all words, actual 637262unscored
subdirectorycw_gen
yet; a future PR is needed to correct this and #6Testing
src/data/unscored/
Notes