Closed ds26gte closed 1 year ago
One additional reason: In our mission to support teachers building their own pathways, we'll need to provide these dictionaries in JSON format anyway so that glossary and standards pages can be dynamically computed. Dorai's build script can easily convert from s-exp to JSON, of course, but this is one less step it needs to take
I should add that I will use one-off macros to pre-convert the existing data into JSON, so you're on your own only when you need to (i) modify existing entries or (ii) add new entries. Modifying an existing file by copying nearby entries should be less daunting.
Please let me know what kind of JSON structure you'd like (what should be key-values vs arrays, what should be the keys' names, etc). My example form was just to get y'all thinking.
I quite like all of the examples, actually. @retabak and @flannery-denny may have other suggestions.
The keywords field in the glossary source is, in the most general case, a list of lists, e.g.,
( ("mean" "means") ("average" "averages") )
The sublists contain grammatical variations and resolve the headword in the extracted glossary to the first element. Thus @vocab{mean}
and @vocab{means}
both resolve to a single entry in the extracted glossary, with headword mean
. The reason we want grammatical variations is because the author typically writes the prose naturally and wraps chosen words with the @vocab
marker. It is possible that only a plural is available to mark, but we still want the headword in the extracted glossary to be the singular.
Multiple sublists support entirely different headwords for the same concept. This is useful because some contexts prefer one or the other. E.g., some texts use mean
exclusively, others average
. In such cases, the presence of @vocab{average}
or @vocab{averages}
in the text that prefers average will use the headword average
in the extracted glossary.
The proposal is to represent these in JSON also as lists of lists, i.e.,
"keywords": [ [ "mean", "means" ], [ "average", "averages" ] ]
NOTE: I've used plurals as grammatical alternatives here, but the glossary extraction is smart about mainstream plurals (but not plurals like oxen
, genera
), so you don't really need the -s
plurals listed. However, this example is used to illustrate the sublisting of keywords. In practice, the example can be compressed to
"keywords": [ [ "mean" ], [ "average" ] ]
However, "mean"
and "average"
can't be put in the same flat list [ "mean", "average" ]
, because then @vocab{average}
would use headword mean
in the extracted glossary, even though this is a text that explicitly avoids the term mean
. Hence the need for the nested listing, even when the individual sublists are singletons.
An alternative is to use multiple keywords
fields with slight name variations.
"keywords": [ "mean", "means" ],
"keywords2": [ "average", "averages" ]
This would work if there were only a small number N
of different keyword categories. However, it is not general, the code requires more careful processing, and we need to change this code even more carefully every time N
needs to be updated.
Let's table this until @flannery-denny can weigh in
Only reading the email version of this, and am happy to take a deeper look when I’m back, but @ds26gte ‘s explanation seems totally logical to me and addresses longstanding issues in our glossary build. My only concern is not about the logic of his solution, but rather about whether authors will keep track of the hierarchy of lists and consistently follow the structure he’s come up with. What do you think @retabak? Is this intuitive for you?FlanneryOn Mar 10, 2023, at 10:04 AM, Emmanuel Schanzer @.***> wrote: Let's table this until @flannery-denny can weigh in
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
It all makes sense to me, except that I can't really think of an instance where I would object to two synonymous terms (i.e., mean and average) showing up in the glossary?
I can think of three uses for vocabulary words in this discussion:
For 1+2, it seems like the solution is to list all of those words in the glossary dictionary, and have them show up in the glossary section of the lesson plan if we tag any of them. For 3, it seems like the solution is just not to tag at all.
This added complexity suggests another use-case not listed above, and it would be good to document it in this issue.
Dorai walked me through a fourth use-case, which relies heavily on the complexity of English pluralization:
Axis/Axes -- these forms are sufficiently different that Dorai's auto-pluralization and auto-gerundization won't catch them, so the authors have to specify them both in the dictionary. The same is true for lots of words in English: child/children, ox/oxen, etc.
That means these vocab lists can become uncomfortably long and messy in certain cases, which means that the "have them show up in the glossary section of the lesson plan if we tag any of them" (mentioned above) will be problematic.
For that reason, we need to introduce build system complexity to deal with the pluralization complexity.
Future Emmanuel, take note.
It's not just archaic plurals, we also have phrases where the plural modifies a non-last word, e.g., circle of evaluation
, circles of evaluation
. It would be super annoying for the extracted glossary to list them both, certainly...
Commit 970a0b7e4 completes all the code for this conversion
@ds26gte ok to close?
Yes, try it out for an independent check, and then close.
We're considering moving how we keep our standards and glossaries from s-exp to JSON format. This is not an under-the-hood change, as these files are edited by the authors, and they have to be comfortable doing so.
The JSON will be more verbose but hopefully more helpful and less fragile. @schanzer has always been for JSON: others also please weigh in your concerns. The argument is that s-exp, with its spare syntax, is easy to mistype and erroneous entries will take a long time to debug, whereas JSON, with its explicit keyword signposts, and clear distinction between association-lists and simple arrays, plus rigorous automatic linting (in text editors), will eliminate authoring errors immediately. (On my text editor, for instance, bad JSON is syntax-highlighted in in-my-face fashion, and I expect the same is the case for Sublime.)
Our current glossary looks like this:
It will become:
Similarly, our current standards look like:
It will become:
In terms of implementation, our Lua scripts can just as easily read s-exp as JSON, so there is no downside either way. It is all up to what the authors find comfortable and useful.