bootstrapworld / curriculum

6 stars 6 forks source link

Moving our standards/textbooks/practices and glossaries from s-exp to JSON format #1291

Closed ds26gte closed 1 year ago

ds26gte commented 1 year ago

We're considering moving how we keep our standards and glossaries from s-exp to JSON format. This is not an under-the-hood change, as these files are edited by the authors, and they have to be comfortable doing so.

The JSON will be more verbose but hopefully more helpful and less fragile. @schanzer has always been for JSON: others also please weigh in your concerns. The argument is that s-exp, with its spare syntax, is easy to mistype and erroneous entries will take a long time to debug, whereas JSON, with its explicit keyword signposts, and clear distinction between association-lists and simple arrays, plus rigorous automatic linting (in text editors), will eliminate authoring errors immediately. (On my text editor, for instance, bad JSON is syntax-highlighted in in-my-face fashion, and I expect the same is the case for Sublime.)

Our current glossary looks like this:

(
  ((en-us ("absolute value") "the (positive) distance of a number from zero, annotated +|+ x +|+")
   (es-mx ("valor absoluto" "modulo de un numero") "la distancia (positiva) de un número al cero, anotada +|+ x +|+"))

  ...
)

It will become:

[ 
   {
     "en-us": {
        "keywords": [ "absolute value" ],
        "description":  "the (positive) distance of a number from zero, annotated +|+ x +|+"
      },
      "es-mx": {
        "keywords": [ "valor absoluto", "modulo de un numero" ],
        "description":  "la distancia (positiva) de un número al cero, anotada +|+ x +|+"
      }
   },

  ...
]

Similarly, our current standards look like:

(
  ("3B-IC-28"
     "Debate laws and regulations that impact the development and use of software."
     "ethics-privacy-and-bias"
     )
  ...
)

It will become:

{ 
  "3B-IC-28": {
     "description": "Debate laws and regulations that impact the development and use of software.",
     "lessons": [
       "ethics-privacy-and-bias"
     ]
  },
  ...
}

In terms of implementation, our Lua scripts can just as easily read s-exp as JSON, so there is no downside either way. It is all up to what the authors find comfortable and useful.

schanzer commented 1 year ago

One additional reason: In our mission to support teachers building their own pathways, we'll need to provide these dictionaries in JSON format anyway so that glossary and standards pages can be dynamically computed. Dorai's build script can easily convert from s-exp to JSON, of course, but this is one less step it needs to take

ds26gte commented 1 year ago

I should add that I will use one-off macros to pre-convert the existing data into JSON, so you're on your own only when you need to (i) modify existing entries or (ii) add new entries. Modifying an existing file by copying nearby entries should be less daunting.

ds26gte commented 1 year ago

Please let me know what kind of JSON structure you'd like (what should be key-values vs arrays, what should be the keys' names, etc). My example form was just to get y'all thinking.

schanzer commented 1 year ago

I quite like all of the examples, actually. @retabak and @flannery-denny may have other suggestions.

ds26gte commented 1 year ago

The keywords field in the glossary source is, in the most general case, a list of lists, e.g.,

( ("mean" "means") ("average" "averages") )

The sublists contain grammatical variations and resolve the headword in the extracted glossary to the first element. Thus @vocab{mean} and @vocab{means} both resolve to a single entry in the extracted glossary, with headword mean. The reason we want grammatical variations is because the author typically writes the prose naturally and wraps chosen words with the @vocab marker. It is possible that only a plural is available to mark, but we still want the headword in the extracted glossary to be the singular.

Multiple sublists support entirely different headwords for the same concept. This is useful because some contexts prefer one or the other. E.g., some texts use mean exclusively, others average. In such cases, the presence of @vocab{average} or @vocab{averages} in the text that prefers average will use the headword average in the extracted glossary.

The proposal is to represent these in JSON also as lists of lists, i.e.,

"keywords": [ [ "mean", "means" ], [ "average", "averages" ] ]

NOTE: I've used plurals as grammatical alternatives here, but the glossary extraction is smart about mainstream plurals (but not plurals like oxen, genera), so you don't really need the -s plurals listed. However, this example is used to illustrate the sublisting of keywords. In practice, the example can be compressed to

"keywords": [ [ "mean" ], [ "average" ] ]

However, "mean" and "average" can't be put in the same flat list [ "mean", "average" ], because then @vocab{average} would use headword mean in the extracted glossary, even though this is a text that explicitly avoids the term mean. Hence the need for the nested listing, even when the individual sublists are singletons.

An alternative is to use multiple keywords fields with slight name variations.

"keywords": [ "mean", "means" ],
"keywords2": [ "average", "averages" ]

This would work if there were only a small number N of different keyword categories. However, it is not general, the code requires more careful processing, and we need to change this code even more carefully every time N needs to be updated.

schanzer commented 1 year ago

Let's table this until @flannery-denny can weigh in

flannery-denny commented 1 year ago

Only reading the email version of this, and am happy to take a deeper look when I’m back, but @ds26gte ‘s explanation seems totally logical to me and addresses longstanding issues in our glossary build. My only concern is not about the logic of his solution, but rather about whether authors will keep track of the hierarchy of lists and consistently follow the structure he’s come up with. What do you think @retabak? Is this intuitive for you?FlanneryOn Mar 10, 2023, at 10:04 AM, Emmanuel Schanzer @.***> wrote: Let's table this until @flannery-denny can weigh in

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

retabak commented 1 year ago

It all makes sense to me, except that I can't really think of an instance where I would object to two synonymous terms (i.e., mean and average) showing up in the glossary?

schanzer commented 1 year ago

I can think of three uses for vocabulary words in this discussion:

  1. Terms that we want displayed in the glossary, and tagged when we use them in a lesson
  2. Terms that are synonyms that we'd want in the glossary if we tag the preferred form in a lesson
  3. Terms that are synonyms that we do NOT want to have show up in the glossary if we tag the preferred form in a lesson

For 1+2, it seems like the solution is to list all of those words in the glossary dictionary, and have them show up in the glossary section of the lesson plan if we tag any of them. For 3, it seems like the solution is just not to tag at all.

This added complexity suggests another use-case not listed above, and it would be good to document it in this issue.

schanzer commented 1 year ago

Dorai walked me through a fourth use-case, which relies heavily on the complexity of English pluralization:

Axis/Axes -- these forms are sufficiently different that Dorai's auto-pluralization and auto-gerundization won't catch them, so the authors have to specify them both in the dictionary. The same is true for lots of words in English: child/children, ox/oxen, etc.

That means these vocab lists can become uncomfortably long and messy in certain cases, which means that the "have them show up in the glossary section of the lesson plan if we tag any of them" (mentioned above) will be problematic.

For that reason, we need to introduce build system complexity to deal with the pluralization complexity.

Future Emmanuel, take note.

ds26gte commented 1 year ago

It's not just archaic plurals, we also have phrases where the plural modifies a non-last word, e.g., circle of evaluation, circles of evaluation. It would be super annoying for the extracted glossary to list them both, certainly...

ds26gte commented 1 year ago

Commit 970a0b7e4 completes all the code for this conversion

schanzer commented 1 year ago

@ds26gte ok to close?

ds26gte commented 1 year ago

Yes, try it out for an independent check, and then close.