JRJurman commented 4 years ago

Summary

In the current top-10000-project-gutenberg-words.json there are several multi letter strokes that are saved as a single word:

"KWR*/*E": "ye",
"T*/*E": "te",
"R*/*E": "re",
"TPH*/*EU": "ni",
...

However this makes fingerspelling any words with these letters have a space in them. For example, if I wanted to fingerspell the word "niceties", the word would appear as

ni ceties

I'm fairly new to steno, but I believe the expected output would be that fingerspelled words are not interrupted with a space, so it should just appear as

niceties

Again, I'm fairly new, so if I'm completely missing something feel free to correct me and close the issue 😄

Potential Solution

I'd hate to be misleading and remove some of the 10000 words in the gutenberg words list, but I think the solution might be to cut out any strokes that are just fingerspellings (if you're already fingerspelling each letter out, then I imagine most people are expecting to add a space anyways).

I can make a query and a PR that replaces these in the dictionaries (specifically in the top-10000, or others as well). There might be some that need review (like "Holt", which is capitalized), but we could go over those individually in the PR review.

JRJurman commented 4 years ago

Here is the list using a simple parser that looks for entries that the same strokes as letters, where each stroke has * in it.

"P*": "p"
"*E/R*": "er"
"S*P/TPH*": "Sn"
"KWR*/*E": "ye"
"*E/T*/KWR*": "ety"
"*U": "u"
"KR*": "c"
"H*/O*/TPH*/O*/*U/R*": "honour"
"KW*": "q"
"P*P/W*P/H*P": "PWH"
"S*P/TK*P": "SD"
"*E": "e"
"O*/PW*/S*": "obs"
"TP*/TPH*": "fn"
"*E/TPH*": "en"
"W*": "w"
"KWR*": "y"
"O*": "o"
"TPH*": "n"
"S*P/H*/A*/K*": "Shak"
"TK*": "d"
"PW*": "b"
"S*/P*/A*/K*/*E": "spake"
"A*/TK*/SR*": "adv"
"TPH*/A*": "na"
"TK*/*E/HR*": "del"
"PW*P/KR*P": "BC"
"PH*": "m"
"TP*": "f"
"P*/S*/*E/*U/TK*": "pseud"
"R*": "r"
"PW*/R*": "br"
"*E/*EU/TPH*": "ein"
"S*/H*/*E/W*": "shew"
"KW*/*U/*EU/T*/T*/*E/TK*": "quitted"
"TPH*P/PH*P": "NM"
"*EU/TPH*/TK*/*EU/*E/S*": "Indies"
"TKPW*": "g"
"TPH*/O*/SR*": "nov"
"K*": "k"
"A*/*U/TKPW*": "aug"
"*E/S*/KW*": "Esq"
"P*/TKPW*": "pg"
"TP*P/*E/PW*": "Feb"
"HR*/PW*": "lb"
"PH*/A*/R*/*E": "mare"
"TPH*P/A*/TPH*": "Nan"
"KR*P/HR*/A*/R*/*E": "Clare"
"S*P/P*/*E/TPH*/S*/*E/R*": "Spenser"
"SKWR*P/*U/TK*/A*/H*": "Judah"
"SKWR*/R*": "jr"
"W*/*E/A*/R*/*EU/*E/TK*": "wearied"
"W*/*EU/TK*": "wid"
"TPH*P/HR*P": "NL"
"P*/PH*": "pm"
"KWR*/O*/TPH*": "yon"
"KR*/*U/PH*": "cum"
"T*/*E": "te"
"W*/A*/TPH*": "wan"
"PH*P/A*/R*/*EU/A*/TPH*": "Marian"
"S*/*E/STKPW*": "sez"
"KR*/O*": "co"
"H*": "h"
"TK*/*E/KWR*": "dey"
"KR*/O*/TPH*/TPH*/*E/KP*/*EU/O*/TPH*": "connexion"
"A*P/K*P": "AK"
"A*P/KR*P": "AC"
"TPH*/*EU": "ni"
"PH*/A*/KR*/A*/*U/HR*/A*/KWR*": "Macaulay"
"PH*/A*/R*/*EU/*U/S*": "Marius"
"TPH*/O*/*U/S*": "nous"
"*E/HR*/*EU/STKPW*/A*": "Eliza"
"PH*/*EU": "mi"
"HR*/HR*": "ll"
"A*/TP*/T*": "aft"
"HR*": "l"
"H*/*E/T*/T*/KWR*": "Hetty"
"TK*/*U/R*/S*/T*": "durst"
"W*/*E/R*/T*": "wert"
"R*/A*/*EU/PH*/*E/TPH*/T*": "raiment"
"H*P/*E/TPH*/R*/*EU/*E/T*/T*/A*": "Henrietta"
"R*/*E": "re"
"H*/O*/HR*/T*": "Holt"
"KR*P/KWR*/R*/*EU/HR*": "Cyril"
"PH*P/TK*P": "MD"
"S*/TKPW*": "sg"
"A*/R*/*EU/TKPW*/H*/T*": "aright"
"TK*/*E/*U/KR*/*E": "deuce"
"PH*/*EU/*E/TPH*": "mien"
"KR*/O*/R*": "cor"
"S*P/O*/A*/PH*/*E/S*": "Soames"
"TP*P/T*P/P*P": "FTP"
"T*": "t"
"S*P/*E/*EU/TPH*/*E": "Seine"
"TK*P/A*/TPH*/*U/PW*/*E": "Danube"
"*URP/KRA*": "CA"

I have the changes locally to remove this from top-10000-project-gutenberg-words.json, and a script that can strip these out of any dictionary file (https://github.com/JRJurman/steno-scripts/blob/master/find-fingerspellings.js). So, I can make a PR if you like, for this and any other dictionary where it seems relevant.

didoesdigital commented 4 years ago

Thanks for highlighting the issue here @JRJurman, as well as writing it up with a potential solution. Thanks as well for the script you made to efficiently make the adjustments. Great work!

I don't think you're missing much about steno here. This is definitely an issue. The only other steno factor is how the glue operator works (notes below). Otherwise, there are Typey Type-specific considerations (notes below).

I think the best approach for now is to:

Update the README to recommend that people turn off the top-10000-project-gutenberg-words.json dictionary in the Plover dictionary config and instead use the other dictionaries from which it is effectively assembled. This would prevent fingerspelled words like "ni" without glue operators causing unwanted spaces in truly fingerspelled words like "niceties" for people using the dictionaries in this repo and avoid Typey Type breaking.

Other Notes

I think the intention for the top-10000-project-gutenberg-words.json dictionary is that it curates preferred entries for the top 10000 words that also exist in other dictionaries such as dict.json or condensed-strokes.json. If you didn't use that dictionary or condensed-strokes.json, you should still be able to type every recommended outline for those words because they already exist in dict.json or can be assembled with prefixes and suffixes or fingerspelling or a combination, including Plover's orthography rules.

The main Plover dictionary itself defines things like the fingerspelled "i" character using the glue operator, for example, "*EU": "{>}{&i}",. The glue operator will only attach fingerspelled letters to other glue strokes. The top-10000-project-gutenberg-words.json includes entries for some fingerspelled letters like:

"KR*": "c",
"*E": "e",
"T*": "t",

Specific to Typey Type, the top-10000-project-gutenberg-words.json is used to improve lookups by word and generate lessons.

To ensure the glue operator works correctly to make the fingerspelled letters attach to previous letters, I think we'd need to remove fingerspelled words like "TPH*/*EU": "ni", as well as fingerspelled letters like "KR*": "c",.

That would have knock on effects. Off the top of my head, that might include:

The top-10000-project-gutenberg-words.json dictionary would not do what it says on the tin: have 10000 entries.
The Top 10000 Project Gutenberg words lesson itself would either not be 10000 words or be unable to find strokes for "c", "e", "t", etc. because it would search for "c" and only find {>}{&c} in the main dictionary and not match.
Any appearance of those letters in lessons as words or the two-letter fingerspelled words might not match in stroke lookups for generating lessons e.g. 'In the first place buy yourself an A B C book of the kind which has a in Doctor Knowall, words like "a" or "A" might not match.
Without words like "ni" or "mare" or "Eliza" in the top-10000-project-gutenberg-words.json, they wouldn't appear at all in lookups.

Most of these factors are problems that should be solved in the Typey Type app code or static lesson generator, which would then make it more practical to remove these entries here.

Alternatively, if we updated all of these entries to use glue operators, e.g. if "TPH*/*EU": "ni", were defined as "TPH*/*EU": "{&ni}", then adding the letter "c" using the entry "KR*": "{&c}", would attach properly, and be predictable with regular fingerspelling behaviour. The downside is that it would probably break some Typey Type behaviour until I've finished https://github.com/didoesdigital/typey-type/issues/6 and https://github.com/didoesdigital/typey-type-data/issues/1 to work around this, unless we add these versions to separate dictionaries that aren't used directly by Typey Type. In that case, we might:

Add a new dictionary called fingerspelled-words-without-glue-operators.json. This dictionary would be turned off in the Plover dictionary config, include fingerspelled words with translations without glue operators, and we'd move all fingerspelled entries from condensed-strokes.json as they are into this dictionary. This would ensure words like "ni" show up in Typey Type lookups and lesson and dictionary generation.
Add a new dictionary called fingerspelled-words-using-glue-operators.json. This dictionary would be turned on in the Plover dictionary config, include fingerspelled words with translations that include glue operators, and we'd copy all fingerspelled entries from condensed-strokes.json into this dictionary (with new glue operators). This would ensure words like "ni" show up in Plover look ups as {&ni}.

… but that would add a maintenance overhead and duplication. Is it worth the effort for handling rare edge cases?

didoesdigital commented 4 years ago

@JRJurman what do you think? @paulfioravanti, I'd also appreciate your take here, having worked with the dictionaries a lot and bumped into some of the edge cases, including fingerspelled entries in condensed-strokes.json.

JRJurman commented 4 years ago

It occurred to me after writing this up that this was in part used for typey-type, and so removing those would have some side-effects (although, clearly not as detailed as you've described here 😄)

When I started using this, I really wanted to ditch the main.json in plover, for something that's more organized, and so I opted to bring in more than not (I don't know if this is a strange use-case, or the standard for this repo).

I think clarifying which dictionary people should start out with (and which ones are good for plover vs other consumers) would be super valuable. This would make the additional dictionaries easier to introduce, ~although I'm not sure I understand the value of the fingerspelled-words-using-glue-operators.json, since having those words in plover wouldn't make a difference, right?~

A couple of general thoughts:

I wish json supported comments so we could detail the intention per file... I realize this is somewhat captured in the README, but just starting out I didn't know which dictionaries included what (and the plover syntax is still a little foreign to me)
Would it be too disruptive to have multiple folders, splitting out folders by intent or category. e.g. dictionaries you should add to plover vs dictionaries that are used for other tools OR sets of dictionaries, where you would realistically only pull one of a set in (100 vs 1000 vs 10000). This would be nice because each group could have a README that goes into a little more detail, without being daunting.

JRJurman commented 4 years ago

I just re-read what you wrote about having words in Plover's lookups... I'm not entirely sure that would be worth the effort... If a lookup is just going to tell me that I should fingerspell it, I almost feel like that should be part of plover (to display the strokes for any word as fingerspelling).

didoesdigital commented 4 years ago

@JRJurman

When I started using this, I really wanted to ditch the main.json in plover, for something that's more organized, and so I opted to bring in more than not (I don't know if this is a strange use-case, or the standard for this repo).

You're not the first person to do this, which shows a growing use case for people choosing the Typey Type set over Plover's, which would suggest it's worth improving this experience. For a couple years it was only me using this set so I didn't need to think much about this. You might want to turn off some fingerspelling dictionaries (e.g. just use fingerspelling.json), symbol-currency-culled.json, top-100-words.json, and top-1000-words.json.

I wish json supported comments so we could detail the intention per file... I realize this is somewhat captured in the README, but just starting out I didn't know which dictionaries included what (and the plover syntax is still a little foreign to me)

Me too. I also wish Plover supported TOML dictionaries or possibly even YAML, which support comments. I'd like to have comments per entry. I've mostly resorted to commenting via commit messages, which is not terribly accessible to the non-developer part of the Plover community.

Would it be too disruptive to have multiple folders, splitting out folders by intent or category. e.g. dictionaries you should add to plover vs dictionaries that are used for other tools OR sets of dictionaries, where you would realistically only pull one of a set in (100 vs 1000 vs 10000). This would be nice because each group could have a README that goes into a little more detail, without being daunting.

This is an interesting idea but I do think it will be too disruptive to have multiple folders. At least at this point in time. Partly because of scripts, partly because of navigating commit history (e.g. commenting entries by commit messages), partly existing links to dictionaries, and partly the overlap in dictionaries (I might want the "vim" dictionary for "coding" and for "dictation").

I'd also like Plover to support "sets of dictionaries" too. For example, switching between "coding dictionaries" and "dictation dictionaries" or whatever.

@JRJurman, what if we added a "recommended dictionaries" section to the README under "Dictionaries"? Recommendations might link to the section of the README with the relevant blurb.

didoesdigital commented 4 years ago

@JRJurman, I'm curious how you found this repo and came to the decision to use these dictionaries? To help me understand more about how people find and use Plover so I can make Typey Type better and that sort of thing.

JRJurman commented 4 years ago

I'm just on the plover discord and noticed you had posted it - although I'm forgetting when now... I had bookmarked it as something to switch over to eventually, and just got around to trying it out this weekend.

As far as updating the README, I think that's perfect 👍

JRJurman commented 4 years ago

Looking now it appears that top-1000-words.json also has a couple of these entries:

{
  "P*": "p",
  "*E/R*": "er",
  "S*P/TPH*": "Sn",
  "KWR*/*E": "ye",
  "*E/T*/KWR*": "ety",
  "*U": "u",
  "KR*": "c",
  "H*/O*/TPH*/O*/*U/R*": "honour"
}

Maybe I'm still confused on which dictionaries should be recommended by default, but the list is getting narrower 😮

JRJurman commented 4 years ago

I'm tempted to just use my script to just make a fingerspell-less version of the 10000 words, but I realize that isn't great for other people who want to be using this repo for dictionaries...

paulfioravanti commented 4 years ago

Just chiming in: I would agree that at this point it might be worth updating the README, rather than create any extra dictionaries.

I haven't actually encountered these kinds of problems yet as I am only using the out-of-the-box Plover dictionaries, and using them to inform issues/PRs I've been submitting (I'll probably switch over to using the dictionaries in this repo once I've hit 100% completion on Typey-Type).

didoesdigital commented 4 years ago

Thanks @JRJurman for the extra details and notes, and thanks @paulfioravanti for adding your voice to this.

I've updated the README here: https://github.com/didoesdigital/steno-dictionaries/blob/master/README.md#how-to-use-these-dictionaries. Does this make it clear enough which dictionaries to use to avoid hitting these issues?

JRJurman commented 4 years ago

@didoesdigital looks great!

I will point out that the condensed-strokes (which you recommend for lookups) does include some of these same issues.

I'm not entirely convinced that this is future-proofed in a way that others won't suffer the same issue, but that might be worth evaluating with a more specific use-case. For now, the suggested dictionaries are clear and make sense.

didoesdigital commented 4 years ago

Good catch, @JRJurman! Updated:

… it can cause spacing issues in rare situations so you may want to add it to your Plover config in a certain order so that it is overwritten by the other dictionaries.

didoesdigital commented 4 years ago

Oh, that won't help. I need to re-think that advice. Maybe just, "turn it on for lookups"?

JRJurman commented 4 years ago

Is there a way to just have a dictionary for lookups? I was thinking last night if this was an option in plover, that would solve a lot of issues... If it is possible, than I think that would be good.

It would also be nice (but certainly not required) if we could annotate in the README, next to the dictionary link something like

*: dictionary contains fingerspellings, so it should be used as a lookup only

JRJurman commented 4 years ago

I can also make a script to just tell us if a dictionary has a fingerspelling in it, so I can get you the full list of all dictionaries with fingerspellings

JRJurman commented 4 years ago

Okay, did a quick reworking of my script, and now you can test for fingerspellings (or at least, something that looks like a fingerspelling).

npm i -g steno-scripts
for dictionary in dictionaries/*.json; do test-fingerspellings "$dictionary"; done > fingerspellings_log.txt

Here is the output https://gist.github.com/JRJurman/57fc4d57efbac5195f4f1c595633e13a

It appears the following dictionaries have fingerspellings

abbreviations.json
code.json (symbols like < and + got caught here, so this might be wrong)
condensed-strokes.json
currency.json
dict.json
dict-en-AU-with-extra-stroke.json
misstrokes.json
nouns.json
punctuation.json
punctuation-di.json
punctuation-powerups.json
symbols-briefs.json
top-10000-project-gutenberg-words.json
top-1000-words.json

some of these are single letter strokes, which I don't believe actually causes the described issue... I can double check, but if that's the case I can modify my script to not catch these.

NOTE: it occurs to me that the fingerspelling dictionaries aren't caught here, that is because I'm checking the length of the string against the number of strokes. Since the fingerspelling dictionaries almost always have some curly braces, these get counted for the string length, and they therefore don't match.

didoesdigital commented 4 years ago

Neat! This is great, @JRJurman . I had a look through your other steno-scripts and there's a lot of handy stuff in there. Well done!

I've updated the README again.

I'm also making some progress on the Typey Type issue to look up briefs on the fly instead of using static files so it might reduce the need for certain dictionaries to have fingerspelling entries.

I'll keep this issue open for now as we chip away at it from different angles.

didoesdigital / steno-dictionaries

Issue with strokes that are letter combinations ("re", "ni", etc) #174

Summary

Potential Solution

Other Notes