exercism / problem-specifications

Shared metadata for exercism exercises.
MIT License
326 stars 541 forks source link

json schema validation: enforce lowerCamelCase keys #987

Open Insti opened 6 years ago

Insti commented 6 years ago

There is a json schema definition for the canoncial-data.json file. https://github.com/exercism/problem-specifications/blob/master/canonical-schema.json

However this still allows for arbitrary keys containing spaces. These keys are often used for specifying test input arguments.

https://github.com/exercism/problem-specifications/blob/637cca615dbbb52d61d0a7a74d71c09360bdebf4/exercises/food-chain/canonical-data.json#L21

Task

Update the https://github.com/exercism/problem-specifications/blob/master/canonical-schema.json to require all keys to be lowerCamelCase

Insti commented 6 years ago

Is this a good idea?

NobbZ commented 6 years ago

I'd not say that parsing is easier or harder, but usage.

In some languages keys are transformed in a way that we can use the keys name as an accessor of a struct/object/record like thing (eg. case.startVerse). If though they contain spaces they need to be often treated differently by using array like accessors (eg case["start verse"]).

I'm not affected by that in erlang anyway. But I have to match on the keys using atoms, and usually atoms in erlang are alphanumerics + underscore, beginning with lowercase alpha. I can though construct atoms that contain all kind of characters (in modern erlang even unicode characters) as long as they do not exceed 256 (or was it 255) bytes in their UTF8 representation. But I have to quote them then using single-quote character: (startVerse vs. 'start verse'). Even though quoted atoms are possible, a rule of thumb is to avoid them, since the quoting often leads to problems in IDEs and when talking with people from the outside world. They often consider them plain strings which is wrong.

So even here I'd prefer a way that avoids quoting the key.

ErikSchierboom commented 6 years ago

So even here I'd prefer a way that avoids quoting the key.

Agreed. I'm also in favor of enforcing lowerCamelCase keys.

rbasso commented 6 years ago

Nice change, @Insti. We didn't enforced that in the schema in #602 at the time because my json-schema-fu [is|was] too weak.

ErikSchierboom commented 2 years ago

@Insti Are you still interested in working on this issue?

ErikSchierboom commented 2 years ago

Looking at the current canonical-data schema, it looks like this has been implemented.

petertseng commented 2 years ago

As you can see in https://github.com/exercism/problem-specifications/pull/1925, what has been asked for in this issue has not been implemented, because https://github.com/exercism/problem-specifications/pull/1925 passed CI.

So if this issue is to be closed, the reason can't be "it has already been implemented" (because it hasn't); the reason has to be "we chose not to implement it".

This comment is neither a recommendation for or against closing this issue.

ErikSchierboom commented 2 years ago

I misread the issue and assumed this was about the property key, and not the input/expected keys. I've just ran the following bash script to figure out if any keys have a space:

for file in exercises/*/canonical-data.json; do
    if [ $(jq -e -r '[paths | join(".") | select(. | contains(" "))] | length > 0' $file) == "true" ]; then
        echo $file
    fi
done

This did not output any file, so it looks like we don't yet have any keys with spaces. The question then becomes: how can we update the schema file to enforce this? What must be considered is that the input/expected keys can be objects, but could also be literal values like booleans, integers or strings.

rbasso commented 2 years ago

The question then becomes: how can we update the schema file to enforce this?

Hi, @ErikSchierboom .

I think you could use propertyNames to enforce a pattern to the keys...

{
  "type": "object",
  "propertyNames": {
    "pattern": "^[a-z]+([A-Z][a-z]+)*[A-Z]?$"
  }
}

...but that would not enforce the pattern inside nested objects automatically, so the following would be valid:

{
  "validKey": {
    "Invalid Key": null
  }
}

Edit - Below is the simplest solution I was able to write to restrict all keys to a regexp. I divided it in three definions, instead of just one, to make usage more flexible:

$defs:

  lowerCamelCaseArray:
    type: array
    items:
      $ref: "#/$defs/lowerCamelCaseValue"

  lowerCamelCaseObject:
    type: object
    patternProperties:
      "^[a-z]+([A-Z][a-z]+)*[A-Z]?$":
        $ref: "#/$defs/lowerCamelCaseValue"
    additionalProperties: false

  lowerCamelCaseValue:
    anyOf:
    - type: "boolean"
    - type: "null"
    - type: "number"
    - type: "string"
    - $ref: "#/$defs/lowerCamelCaseArray"
    - $ref: "#/$defs/lowerCamelCaseObject"

$ref: "#/$defs/lowerCamelCaseObject"

Please, check if it works as expected!

Right now, I don't have the time to write the PR, but I hope this helps. 😄

rbasso commented 2 years ago

I just wrote #1929 to check the possible impact of the change.

The following exercises have nonLowerCamelCase keys:

alphametics
circular-buffer
clock
complex-numbers
custom-set
etl
hamming
list-ops
nucleotide-count
queen-attack
rational-numbers
react
rest-api
sgf-parsing
word-count

At first sight, there are different problems:

ps: We should probably consider upgrading ajv to a newer version and the JSON Schema to Draft 2020-12.

Edit:

I just saw the README.md, and I'm a little bit confused. If test cases are immutable, how could we change an offending key?

Test cases are immutable, which means that once a test case has been added, it never changes.

It's been a while since the time I was active here, so maybe I'm getting something wrong, but I think this design rules out enforcing new schema restrictions on input and expected.

ErikSchierboom commented 2 years ago

Wow, so many keys in a different format. That is problematic.

It's been a while since the time I was active here, so maybe I'm getting something wrong, but I think this design rules out enforcing new schema restrictions on input and expected.

You are totally correct. The only option we have is to add new test cases that reimplement existing ones, but such a change shouldn't change the actual keys as it will still break things that way. I think we'll have to consider not enforcing lowerCamelCase for input and expected values, unless someone has a great idea?

petertseng commented 2 years ago

Ah, right, because in word-count a track's test generator is never meant to read the key and values of the expected; it's just meant to treat the expected as an object to compare the answer to. As opposed to something like zipper where a track's test generator does want to read the keys and values of expected.

What is the minimum we could do? Something like:

Only enforce the first level of keys under input. Do not descend into objects nested underinput to validate their keys, nor enforce anything in expected.

Would that still have value?

ErikSchierboom commented 2 years ago

Only enforce the first level of keys under input. Do not descend into objects nested underinput to validate their keys, nor enforce anything in expected.

Would that still have value?

I think so, as the keys on the first level of input are usually directly converted to parameter names. This is not the case for expected.

@rbasso Could you check to see if there are any first-level input keys that don't use lowerCamelCase?

rbasso commented 2 years ago

@rbasso Could you check to see if there are any first-level input keys that don't use lowerCamelCase?

@ErikSchierboom , changing the schema to enforce lowerCamelCaseObject the rule (edit) on input we get the following errors:

clock           : property name 'clock1'      is invalid
complex-numbers : property name 'z1'          is invalid
custom-set      : property name 'set1'        is invalid
hamming         : property name 'strand1'     is invalid
list-ops        : property name 'list1'       is invalid
queen-attack    : property name 'white_queen' is invalid
rational-numbers: property name 'r1'          is invalid

Anyway, I think this PR should be closed for now, because the inability to enforce new schema rules is a limitation implied by design choices, which cannot be simply fixed without a new design.

If I remember correctly, when the schema was created (#602) - after a long and heated discussion (#336) I partially regret 😬 - we allowed nested groups of tests forming a tree-like structure, to model the existing data, and the entire file was versioned to allow test generators to detect API-breaking and content-only changes.

If I understand correctly #1674, test cases now have UUIDs and are treated as immutable entities, which tracks may implement.

I think #602 and #1674 have underlying conflicting views, because, in #602, the canonical-data.json represent the latest version of a test tree, but, after #1674, it is also a storage of all test versions.

I think the current canonical-data.json format does both tasks badly:

I didn't read all the discussion behind #1674, so take anything I wrote with a grain of salt, but I guess this is a "design smell"

ErikSchierboom commented 2 years ago

Thanks for the writeup. We'll need to consider these things in more detail.