canonical-data.json needs standardisation

catb0t commented 8 years ago

Hello,

I maintain the Factor track, and I'd like to automate generation of unit tests for exercises in my language.

Looking at exercises/leap/canonical-data.json it would seem to be quite simple. However, many of the canonical-data.jsons don't have a standard set of keys found in leap's json, and this makes it difficult to automate around.

There are, as far as I can tell, two solutions to the problems introduced by the inconsistencies.

Rather than hardcoding the description, input and expected keys, use a regex / fuzzy find to group keys into description, input and output. The main disadvantages of this are twofold: not only must my code be flimsy, but so must everyone else's, and subject to break on the whims of anyone.
Standardise on a fixed, predictable set of keys and what their values represent. This makes the jobs of track maintainers easier, simplifies interacting code, and future-proofs the api and the code.

I think standardisation would be greatly beneficial, and if we make an API more accessible, perhaps more tracks will automate generation / regeneration of tests, which would be positive.

But before I open a pull request with structural changes to hundreds of lines of data, I'd like some feedback.

First, is anyone objected to changing the names of the keys? They're rather haphazard (nearly as if it had been written for humans to read ): ) and some exercises are missing canonical-data.json altogether, and consequently I have difficulty believing there are programs reading this stuff. Second, what keys should be used? I'm thinking something like:

For exercises with one input translating to one output, description, input and output.
For exercises with multiple inputs / multiple outputs, description, input_N, output_N.

Note that it would be disadvantageous to use an array for multiple inputs / outputs where an array is not part of the exercise because it would be hard or impossible to tell the difference between multiple inputs and an actual array. We could have keys like input_multi which is an array of inputs, I suppose?

Thoughts?

catb0t commented 8 years ago

Also while I'm talking about this API, are the canonical-data.json hosted somewhere other than https://raw.githubusercontent.com/exercism/x-common/master/exercises/${EXERCISE}/canonical-data.json, or is that where I should grab it from?

catb0t commented 8 years ago

@kytrinyx Idk if you get these notifications :(

petertseng commented 8 years ago

Duplicates #336

Also while I'm talking about this API, are the canonical-data.json hosted somewhere other than https://raw.githubusercontent.com/exercism/x-common/master/exercises/${EXERCISE}/canonical-data.json, or is that where I should grab it from?

I believe that is the place; at least I'm not aware of any other places!

nearly as if it had been written for humans to read

I think this may not be that far from the truth, though I will argue later on: "why not both?"

I have difficulty believing there are programs reading this stuff.

I documented a few examples in https://github.com/exercism/x-api/issues/113

Go:

An example_gen.go in each exercise directory e.g. https://github.com/exercism/xgo/blob/master/exercises/leap/example_gen.go - it defines the structure that the file is expected to have.
Common code at https://github.com/exercism/xgo/blob/master/gen/gen.go

Ruby:

A small script in the bin directory e.g. https://github.com/exercism/xruby/blob/master/bin/generate-leap
A file in lib adding convenience functions on each case e.g. https://github.com/exercism/xruby/blob/master/lib/leap_cases.rb
An example.tt in each exercise directory e.g. https://github.com/exercism/xruby/blob/master/exercises/leap/example.tt

Between the fields referenced in example.tt and lib, that defines the structure that the JSON is supposed to have.

Scala: https://github.com/exercism/xscala/tree/master/testgen/src/main/scala

There are case classes defining the expected structure.

So what does this all mean! This means that currently, these tracks have to define the expected structure on a per-exercise basis! Standardisation could allow them to have less custom logic per exercise. I'm not sure it's avoidable for some statically typed languages, though, since they may still have to define the types of the values (some exercises have integer inputs, some exercises have string inputs, etc)

For exercises with multiple inputs / multiple outputs, description,input_N, output_N.

I see that this is easy for a machine to read. Can we simultaneously make it easy for a human to read as well? Consider that in e.g. https://github.com/exercism/x-common/blob/master/exercises/all-your-base/canonical-data.json I imagine that many tracks will pass in three inputs: input_base, input_digits, output_base, and then check that the output digits are as specified in output_digits. If the data then simply looked like "input_1": 2, "input_2": [1], "input_3": 10, "output": [1] I think it might not be clear what is the difference between input_1 and input_3 to a human, and I consider this important for being able to understand PRs that propose to change the test cases.

catb0t commented 8 years ago

I didn't think this really was a dupe of #336 because I'd read that before, but perhaps you're right.

I believe we can simultaneously make the JSON easier for humans and programs to read, but the way it is now makes it very hard to make a generalising program.

The examples you've linked to share a common problem: because each exercise has a different structure, each exercise needs its own separate, different test generator program.

This is, to me, to put it plainly, an insane amount of unecessary work -- my goal with exercism.autogen-exercises is to generate all the tests for all the exercises at once which should be trivially possible. I don't want a different ${exercisename}-testgen.factor for each different JSON structure.

kytrinyx commented 8 years ago

Idk if you get these notifications :(

I do, but I've been traveling for the past week and am only just catching up.

I would like this to be merged with #336 which is the same topic. The goal is the same in both of these threads: to be able to generate the test suites.

@catb0t would you mind collecting your arguments and observations from this thread and adding them to the other one? That would let you and @zenspider and @devonestes get on the same page about what the problem and potential solutions are, and others could chime in to help sort it out as well.

exercism / problem-specifications

canonical-data.json needs standardisation #376