didoesdigital / steno-dictionaries

Di's Plover-theory stenography dictionaries used by Typey Type for Stenographers.
GNU General Public License v2.0
85 stars 19 forks source link

Duplicate Keys in dictionaries ("sticks" / "statistics") #176

Closed JRJurman closed 3 months ago

JRJurman commented 4 years ago

Summary

When looking at the project files in VS Code, I realized that it had highlighted a problem in the top-10000-project-gutenberg-words.json. The problem was a duplicate key in JSON, which, while technically valid, in plover's case is probably not super useful. Most json parsers (and I'm guessing plover including) will ignore the first entry and just show the second (and indeed, if I filter for sticks in the plover dictionary editor, it does not show up).

Potential Solution

I'm not entirely sure how the chords are made for these words, this feels like something that just needs to be updated though - potentially changing sticks to STEUBG/-S

Futureproofing

There is a node module, find-duplicated-property-keys that takes in a dictionary and prints out if there are any duplicated keys. I ran this using the following script

npm i -g find-duplicated-property-keys
for dictionary in dictionaries/*.json; do find-duplicated-property-keys -s "$dictionary"; done > duplicates_log.txt

I install duplicated-property-keys globally (requires node on the machine... technically I could use npx but the command is already kinda slow on these larger files, and doing an install for every file is overkill)

I then run a bash for loop that runs the command, passing in every dictionary in the dictionaries/ folder. The output is forwarded to a duplicates_log.txt, however this part could be removed to just show the output on the command line. It looks something like this:

The following duplicated property keys have been detected in dictionaries/top-10000-project-gutenberg-words.json:
<instance>.STEUBGS
No duplicated property keys found in dictionaries/top-1000-words.json.
No duplicated property keys found in dictionaries/top-100-words.json.
No duplicated property keys found in dictionaries/top-200-words-spoken-on-tv.json.

And I got a lot of duplicated key warnings, in 12 different files.

bad-habits.json, code.json, condensed-strokes.json, currency.json, dict-en-AU-vocab.json, javascript.json, medical-suffixes.json, nouns.json, punctuation-di.json, and top-10000-project-gutenberg-words.json all had a small handful.

modifiers.json has around 2000, numbers.json has around 70.

I've uploaded the output here: https://gist.github.com/JRJurman/ba259871c67f7e086fac01797a72f11a

paulfioravanti commented 4 years ago

Nice work! For the "statistics" issue, it looks like dict.json is missing a ST*BGS stroke for "statistics", which is in the Plover dictionaries. Given that, Plover says STEUBGS is for "sticks", I think you could potentially submit a PR that does the following:

JRJurman commented 4 years ago

I can make a PR with those changes later today 👍

It didn't even occur to me to look at the original plover dictionary for a resolution ✨ . Do you want me to investigate the other conflicts? I can at least give a precursory look to see if there are other easy resolutions... although I feel like numbers and modifiers will be harder to tackle.

JRJurman commented 4 years ago

I can make those separate PRs too, since I wouldn't want to hold up this change.

paulfioravanti commented 4 years ago

Do you want me to investigate the other conflicts? I can at least give a precursory look to see if there are other easy resolutions

I'd say go for it! Anything that makes the dictionaries better for all of us steno learners is a win!

didoesdigital commented 4 years ago

Thanks for putting this together @JRJurman 👏

Yes, while JSON permits duplicate keys, it's not ideal in practical use. I believe Plover will see every entry, but overwrite previous entries when it finds an outline that already exists, so it's ok to have globally duplicate keys across dictionaries, so long as you know what order to keep your dictionaries in. It's more of an issue to have duplicates within dictionaries as we have here.

For the other dictionaries with duplicates, it will be handy to have separate issues and PRs for those to discuss how to resolve some of the duplicates and ship them one after another.

bad-habits.json could possibly continue to have duplicates. It's an accumulation of bad entries from different places so one key with 2 values might both be wrong and worth marking as bad habits. Doesn't deal with the ambiguity of the keys, but it's not super important to fix.

condensed-strokes.json would be great to fix soon. I've just pushed a branch for fixing the duplicates in numbers.json.

modifiers.json would be worth regenerating from the original script that built it the first time. I'd have to dig up where that came from so it can be updated and re-run.

Thanks @paulfioravanti for outlining the resolution to "sticks" and "statistics". These look good!

didoesdigital commented 4 years ago

On futureproofing, it might be nice to set up Travis CI to highlight introduced duplicates on PRs to prevent regressions.

didoesdigital commented 4 years ago

As a reminder to myself, here's the link to a convenient online tool for validating JSON that also highlights duplicates: Miscue-js -- JSON validation.

JRJurman commented 4 years ago

If there are scripts to generate some of these .json files, it might be worthwhile keeping them in this repository (under a scripts folder or something?). This would be useful if we wanted to include a duplicates checker or other dictionary generators in the project.

That being said, I do love that for the most part this is just a bunch of json files and it doesn't require reading or installing anything to get the dictionaries.

paulfioravanti commented 4 years ago

I would tend to agree with the latter part of your comment. Just keep this repo as it says on the tin: steno-dictionaries.

Unless @didoesdigital wants to continue to maintain and evolve the scripts under her account somewhere, perhaps in a separate repo, I'd say they'd have value as one of your personal projects, @JRJurman.

didoesdigital commented 4 years ago

Between https://github.com/didoesdigital/typey-type/issues/6 and https://github.com/didoesdigital/typey-type-data/issues/1, I hope to make the private static lesson generator redundant and therefore have fewer scripts using the json files. Most of the 'generated' .json files are stored in https://github.com/didoesdigital/typey-type-data/ at the moment. I'm still on the fence about where the bulk of the logic should live for checking the quality of dictionaries and lessons, and what should be generated and stored vs figured out on the fly (e.g. specific dictionaries could be built in app from lessons). For now, I think I want to keep this repo fairly basic.

timon commented 2 years ago

Hey @didoesdigital, would you mind a PR that removes duplicated entries from modifiers.json? I just ran jq over the file, because it was hard to figure out what's going on when looking some modifier strokes with grep

didoesdigital commented 2 years ago

Sure! Let’s have a look at a PR and see what’s going on there.

didoesdigital commented 3 months ago

The PRs to remove duplicates were merged years ago, so I think we can close this. The Typey Type CLI is also open source now showing all the scripts that use these dictionaries to build Typey Type lessons and support the app. Thanks!