Closed JRJurman closed 3 months ago
Nice work! For the "statistics" issue, it looks like dict.json
is missing a ST*BGS
stroke for "statistics", which is in the Plover dictionaries. Given that, Plover says STEUBGS
is for "sticks", I think you could potentially submit a PR that does the following:
"STEUBGS": "sticks"
entry in dict.json
"STEUBGS": "statistics"
entry in dict.json
to "ST*BGS": "statistics"
"STEUBGS": "statistics"
entry in top-10000-project-gutenberg-words.json
to "ST*BGS": "statistics"
I can make a PR with those changes later today 👍
It didn't even occur to me to look at the original plover dictionary for a resolution ✨ . Do you want me to investigate the other conflicts? I can at least give a precursory look to see if there are other easy resolutions... although I feel like numbers
and modifiers
will be harder to tackle.
I can make those separate PRs too, since I wouldn't want to hold up this change.
Do you want me to investigate the other conflicts? I can at least give a precursory look to see if there are other easy resolutions
I'd say go for it! Anything that makes the dictionaries better for all of us steno learners is a win!
Thanks for putting this together @JRJurman 👏
Yes, while JSON permits duplicate keys, it's not ideal in practical use. I believe Plover will see every entry, but overwrite previous entries when it finds an outline that already exists, so it's ok to have globally duplicate keys across dictionaries, so long as you know what order to keep your dictionaries in. It's more of an issue to have duplicates within dictionaries as we have here.
For the other dictionaries with duplicates, it will be handy to have separate issues and PRs for those to discuss how to resolve some of the duplicates and ship them one after another.
bad-habits.json
could possibly continue to have duplicates. It's an accumulation of bad entries from different places so one key with 2 values might both be wrong and worth marking as bad habits. Doesn't deal with the ambiguity of the keys, but it's not super important to fix.
condensed-strokes.json
would be great to fix soon. I've just pushed a branch for fixing the duplicates in numbers.json
.
modifiers.json
would be worth regenerating from the original script that built it the first time. I'd have to dig up where that came from so it can be updated and re-run.
Thanks @paulfioravanti for outlining the resolution to "sticks" and "statistics". These look good!
On futureproofing, it might be nice to set up Travis CI to highlight introduced duplicates on PRs to prevent regressions.
As a reminder to myself, here's the link to a convenient online tool for validating JSON that also highlights duplicates: Miscue-js -- JSON validation.
If there are scripts to generate some of these .json
files, it might be worthwhile keeping them in this repository (under a scripts folder or something?). This would be useful if we wanted to include a duplicates checker or other dictionary generators in the project.
That being said, I do love that for the most part this is just a bunch of json files and it doesn't require reading or installing anything to get the dictionaries.
I would tend to agree with the latter part of your comment. Just keep this repo as it says on the tin: steno-dictionaries.
Unless @didoesdigital wants to continue to maintain and evolve the scripts under her account somewhere, perhaps in a separate repo, I'd say they'd have value as one of your personal projects, @JRJurman.
Between https://github.com/didoesdigital/typey-type/issues/6 and https://github.com/didoesdigital/typey-type-data/issues/1, I hope to make the private static lesson generator redundant and therefore have fewer scripts using the json
files. Most of the 'generated' .json
files are stored in https://github.com/didoesdigital/typey-type-data/ at the moment. I'm still on the fence about where the bulk of the logic should live for checking the quality of dictionaries and lessons, and what should be generated and stored vs figured out on the fly (e.g. specific dictionaries could be built in app from lessons). For now, I think I want to keep this repo fairly basic.
Hey @didoesdigital, would you mind a PR that removes duplicated entries from modifiers.json
?
I just ran jq
over the file, because it was hard to figure out what's going on when looking some modifier strokes with grep
Sure! Let’s have a look at a PR and see what’s going on there.
The PRs to remove duplicates were merged years ago, so I think we can close this. The Typey Type CLI is also open source now showing all the scripts that use these dictionaries to build Typey Type lessons and support the app. Thanks!
Summary
When looking at the project files in VS Code, I realized that it had highlighted a problem in the
top-10000-project-gutenberg-words.json
. The problem was aduplicate key in JSON
, which, while technically valid, in plover's case is probably not super useful. Most json parsers (and I'm guessing plover including) will ignore the first entry and just show the second (and indeed, if I filter forsticks
in the plover dictionary editor, it does not show up).Potential Solution
I'm not entirely sure how the chords are made for these words, this feels like something that just needs to be updated though - potentially changing
sticks
toSTEUBG/-S
Futureproofing
There is a node module,
find-duplicated-property-keys
that takes in a dictionary and prints out if there are any duplicated keys. I ran this using the following scriptI install duplicated-property-keys globally (requires node on the machine... technically I could use
npx
but the command is already kinda slow on these larger files, and doing an install for every file is overkill)I then run a bash for loop that runs the command, passing in every dictionary in the
dictionaries/
folder. The output is forwarded to aduplicates_log.txt
, however this part could be removed to just show the output on the command line. It looks something like this:And I got a lot of duplicated key warnings, in 12 different files.
bad-habits.json
,code.json
,condensed-strokes.json
,currency.json
,dict-en-AU-vocab.json
,javascript.json
,medical-suffixes.json
,nouns.json
,punctuation-di.json
, andtop-10000-project-gutenberg-words.json
all had a small handful.modifiers.json
has around 2000,numbers.json
has around 70.I've uploaded the output here: https://gist.github.com/JRJurman/ba259871c67f7e086fac01797a72f11a