Add a script to update json schema strings from one file

bdarcus commented 4 years ago

How would that look like and how would that solve the problem with schema verbosity? Would it?

A python script in GitHub Actions could compile the csl-data.json at commit time. It would have a list of all types and variables (separated by category) and dynamically construct the JSON. The main benefit would be ease of maintenance and updating, not needing to manually keep four nearly identical lists aligned manually.

Originally posted by @bwiernik in https://github.com/citation-style-language/schema/issues/167#issuecomment-639037322

bdarcus commented 4 years ago

Regardless of the details, the goal: any commits on the rnc files that hold the variable strings (see this, for example) would automatically update the json schema files, ideally as part of the same commit, so we could simply do:

$ git commit -m 'Add "foo" variable'

Also, we can move to having these maintained in a separate YAML file if that would be easier.

PR welcome.

Also, happy to restructure those files a bit to make the task easier.

denismaier commented 4 years ago

I am not sure this is what @bwiernik was talking about in his post:

For the data schema, an option might be to split out the schemas into separate files that match the RNC type and variable structure and use a build script to compile them at commit time.

Originally posted by @bwiernik in https://github.com/citation-style-language/schema/issues/167#issuecomment-638839010

I don't read this as building the json schema from the rnc schema.

denismaier commented 4 years ago

Also, variables in json and rnc are not necessarily identical.

bdarcus commented 4 years ago

Doh!

In any case, if he wants to create an issue from that comment, with the correct description, he should feel free.

bdarcus commented 4 years ago

I will add, however, compared to rnc, I despise json schema. It might be worth considering a simpler list format that a script could convert into the json?

bwiernik commented 4 years ago

No, something like this is totally what I was thinking about. It might be easier to do a script that produces both the RNC and JSON, but this is in the right direction.

denismaier commented 4 years ago

Ok, sorry for the noise.

bdarcus commented 4 years ago

We just need a script that will parse the rnc file(s) into patterns and strings, that would match what would be in the output target; something like:

def parse(input):
    patterns = [ 'variables.dates', 'variables.names', 'variables.numbers', 'variables.strings','variables.titles']
    for list in patterns:
        # parse into list of tokens
        # iterate through to convert to JSON schema

denismaier commented 4 years ago

We just need a script that will parse the rnc file(s) into patterns and strings, that would match what would be in the output target; something like:

Unless we have variables that occur in the rnc files, but not in the json schema; or the other way round.

bdarcus commented 4 years ago

Right, but we can adjust the patterns so they line up, and/or to add some additional logic to adjust where easier/better?

bdarcus commented 4 years ago

If we move to a non rnc source format, I vote for YAML, with lists based on output datatypes.

bdarcus commented 4 years ago

Took me a minute or two to convert the rnc file to yaml.

There's not much difference, other than YAML not requiring customized parsing.

But beyond that, another advantage here might be that we could also add other things, like documentation?

variables.dates:
  - accessed:
      description: Some short textual description that is a bit of a longer line to see if there's any issues with longer descriptions.

denismaier commented 4 years ago

That looks good. How did you do that? Manually?

There's not much difference, other than YAML not requiring customized parsing.

Well, I think the main advantage is that we are not restricted by what is possible in rnc. We can then probably do things like that:

variables.names: 
  - author:
    variants:
      - container
      - reviewed
      - original

bdarcus commented 4 years ago

Manually; I used vim search/replace.

bdarcus commented 4 years ago

I was playing a bit, just to see how it could work.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import yaml
import json
from string import Template 

# Define the RNG template, to assemble the complete RNG file, to pass to Trang.
rng_template = Template('<value>$variable</value>') 

vars = open('csl-variables.yaml', 'r').read()

dates = yaml.safe_load(vars)['variables.dates']

for var in dates:
    print(rng_template.substitute(variable=var))

All it does currently is output this:

<value>accessed</value>
<value>available-date</value>
<value>container</value>
<value>event-date</value>
<value>issued</value>
<value>original-date</value>
<value>submitted</value>

denismaier commented 4 years ago

So that would mean switching to rng xml syntax? Can this be used together with the main schema being in compact syntax?

bdarcus commented 4 years ago

So that would mean switching to rng xml syntax?

I was only thinking about the xml syntax internally; as in, pipe it to trang, to convert to the rnc.

Not sure it's needed; just an idea.

denismaier commented 4 years ago

I've been toying around a bit with this here:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import yaml
import json
from string import Template 

# Define the RNG template, to assemble the complete RNG file, to pass to Trang.
rng_template = Template('<value>$variable</value>') 

vars = open('csl-variables.yaml', 'r').read()

dates = yaml.safe_load(vars)['variables.dates']
names = yaml.safe_load(vars)['variables.names']
numbers = yaml.safe_load(vars)['variables.numbers']
strings = yaml.safe_load(vars)['variables.strings']
titles = yaml.safe_load(vars)['variables.titles']

def create_variable_variants(variables, variants):
    pass

def create_variable_variant(variables, modifier, affix):
    new_vars = []
    if modifier=="prefix":
        for var in variables:
            new_vars.append(affix + "-" + var)
    else:
        for var in variables:
            new_vars.append(var + "-" + affix)
    return variables + new_vars

strings_with_short = create_variable_variant(variables=strings, modifier="suffix", affix="short")
strings_with_reviewed = create_variable_variant(variables=strings_with_short, modifier="prefix", affix="reviewed")
strings_with_original = create_variable_variant(variables=strings_with_short, modifier="prefix", affix="original")
strings = strings_with_short + strings_with_original + strings_with_reviewed
print(strings)

Takes a list of variables and builds the other variants.

dhimmel commented 4 years ago

As far as the CSL JSON Data schema goes, I think creating it from a script is adding unnecessary complexity. It requires contributors to not only learn JSON Schema but also additional languages / infrastructure for generating the schema.

In https://github.com/citation-style-language/schema/pull/271#issuecomment-649540638, we discuss JSON Schema's pattern properties which can provide us much of the flexibility we need:

{
  "type": "object",
  "patternProperties": {
    "^title(-long|-sub|-main|-short)?$": { "type": "string" },
    "^container-title(-long|-sub|-main|-short)?$": { "type": "string" },
    "^collection-title(-long|-sub|-main|-short)?$": { "type": "string" }
  },
  "additionalProperties": false
}

Sorry I'm not totally up to date on the motivations here. How much busywork/biolerplate is there on the RNC side to justify a creation script?

In my opinion, there is considerable overhead to auto-generating these files. For example, it's pretty daunting for an unfamiliar contributor to read through build-variable-schemas.py proposed in https://github.com/citation-style-language/schema/pull/263.

bdarcus commented 4 years ago

How much busywork/biolerplate is there on the RNC side to justify a creation script?

RNC has no cruft at all; it's the xml equivalent of yaml.

bdarcus commented 2 years ago

@bwiernik @denismaier - just calling your attention to this issue, as It's related to what we discussed at the end.

I haven't reread Dan's concerns about this (edit: did just now; it's basically that it adds too much complexity for minimal benefit), but my thought originally was what if you had a file like this:

types:
  - name: book
    description: Some description.

... and then a diff to add a new type was just something like this ...

>   - name: foo
>     description: Some other description.
>

... and it could be auto-populated to rnc, json schema, AND documentation?

If there was a typo in the description, it could then be fixed in one place.

If this is a bad idea, though, maybe we should close this. But I thought worth re-asking the question.

Edit: we already merged a script from Denis, but this is partially what Dan was objecting to.

citation-style-language / schema

Add a script to update json schema strings from one file #223