bioimage-io / spec-bioimage-io

Specification for the bioimage.io model description file.
https://bioimage-io.github.io/spec-bioimage-io/
MIT License
18 stars 17 forks source link

single source of truth for specification #74

Closed k-dominik closed 3 years ago

k-dominik commented 3 years ago

Right now there are two souces of truth for the specification that I know of. The Readme in this Repo as well as the validator. Those two are out of sync. It would be nice to generate one out of the other - something like swagger would do for rest-apis.

oeway commented 3 years ago

Good idea @k-dominik

We had some discussions a while ago about generating json schema from the python code. I just did a quick search on documentation generator for json schemas and find this python library which can potentially use: https://coveooss.github.io/json-schema-for-humans/#/

They can generate markdown files like this one: https://github.com/coveooss/json-schema-for-humans/blob/master/docs/examples/examples_md_with_badges/additional_properties.md

See here for more examples: https://coveooss.github.io/json-schema-for-humans/#/Examples

If we go for that, the json schema can also be used by the website where we can render a UI form for users to generate model.yml file.

constantinpape commented 3 years ago

Yes, having a json-schema would be awesome. I think we made quite a big mistake when we didn't just start out from using it... It's also fairly easy to write a custom doc generator; I did this for a recent spec I worked on: https://github.com/mobie/mobie.github.io/blob/master/scripts/schema_to_markdown.py

FynnBe commented 3 years ago

we might want to take a close look at apispec which has built-in support for marshmallow, which we use in the validator. We can generate a json schema from it, but it basically looses a bit of detail already...

constantinpape commented 3 years ago

we might want to take a close look at apispec which has built-in support for marshmallow, which we use in the validator.

That looks pretty straightfoward. As an alternative, I found https://github.com/fuhrysteve/marshmallow-jsonschema.

We can generate a json schema from it, but it basically looses a bit of detail already...

What is lost? jsonschema is pretty powerful, so I think it should be able to express most of our static constraints in the spec. Of course dynamic checks, like whether a file exists or not cannot be validated with it.

What worries me a bit about this is the level of abstraction: to change something in the spec, we would need to change it in the marshmallow code, which would then be turned into jsonschema and which would then be rendered into the description. If we go that road, we would need to describe the marhsmallow validator (which I find has a pretty confusing code base) much better.

FynnBe commented 3 years ago

As an alternative, I found https://github.com/fuhrysteve/marshmallow-jsonschema.

We actually already use that! I just updated it in https://github.com/bioimage-io/python-bioimage-io/pull/58 I see the jsonschema more as a less detailed spec description to render a web form, etc. but imho it cannot replace the more thorough marshmallow validation...

What is lost? jsonschema is pretty powerful, so I think it should be able to express most of our static constraints in the spec. Of course dynamic checks, like whether a file exists or not cannot be validated with it.

jsonschema is much less powerfull than marshmallow. Afaik, checking if two fields 'match', or if a field is optional dependent on another one, cannot be represented in json schema. In marshmallow we have for examle this check:

    @validates_schema
    def language_and_framework_match(self, data, **kwargs):
        field_names = ("language", "framework")
        valid_combinations = {
            ("python", "scikit-learn"): {"requires_source": False},
            ("python", "pytorch"): {"requires_source": True},
            ("python", "tensorflow"): {"requires_source": False},
            ("java", "tensorflow"): {"requires_source": False},
        }
        combination = tuple(data[name] for name in field_names)
        if combination not in valid_combinations:
            raise PyBioValidationException(f"invalid combination of {dict(zip(field_names, combination))}")

        if valid_combinations[combination]["requires_source"] and data.get("source") is None:
            raise PyBioValidationException(
                f"{dict(zip(field_names, combination))} require source code to be specified."
            )

What worries me a bit about this is the level of abstraction: to change something in the spec, we would need to change it in the marshmallow code, which would then be turned into jsonschema and which would then be rendered into the description. If we go that road, we would need to describe the marhsmallow validator (which I find has a pretty confusing code base) much better.

I share your concerns, thus I am looking into a low tech solution to generate the spec from the marshmallow specr source code. Better documentation and improvements to the marshmallow validator would certainly be a good idea independently!

constantinpape commented 3 years ago

jsonschema is much less powerfull than marshmallow. Afaik, checking if two fields 'match', or if a field is optional dependent on another one, cannot be represented in json schema. I

That's not true, there are at least two patterns to check a field conditional on another field: https://json-schema.org/understanding-json-schema/reference/conditionals.html https://stackoverflow.com/questions/58169098/json-schema-required-field-depending-on-other-field-value

FynnBe commented 3 years ago

That's not true, there are at least two patterns to check a field conditional on another field:

I didn't know, that's really neat! Wondering if we could represent the whole think purely in json schema, I am looking at the other validations we apply. e. g

    @validates_schema
    def matching_lengths(self, data, **kwargs):
        scale = data["scale"]
        offset = data["offset"]
        if len(scale) != len(offset):
            raise PyBioValidationException(f"scale {scale} has to have same length as offset {offset}!")

Even this might be possible to validate with json schema (with "AllOf" somhow..?), but would probably generate very cryptic error messages...

Maybe this is beside the point anyway, as we do want additional dynamic checks as well.

edit: some sources: can't compare array lengths (but it's old json 0.4): https://stackoverflow.com/questions/38624299/json-schema-compare-two-properties-to-be-equal-length-array no arrays, but interdependent field checks are somehow possible (but very complicated): https://stackoverflow.com/questions/53206547/json-schema-property-dependent-on-value-of-previous-property

there might definitely be some more json schema functionality I am not aware of...

constantinpape commented 3 years ago

Even this might be possible to validate with json schema (with "AllOf" somhow..?), but would probably generate very cryptic error messages...

There are some subtleties to this. It's certainly possible if you know the allowed length values (which would be the case for us). You would use oneOf and then specify the exact length via minElements and maxElements for both scale and offset for all length values. There might even be a more elegant way if you predefine your list types.

Maybe this is beside the point anyway, as we do want additional dynamic checks as well.

In general I agree, there are things that become very complicated, so it's better to do them with dynamic checks rather than using jsonschema.

But a simple if property A exists we must also have property B is straighforward using the oneOf keyword. What helps here to keep specs organized is to use references for more complex types, so that these don't need to be repeated all the time.

oeway commented 3 years ago

Hi, I think it might be a good idea that for us to try to express our spec completely in json schema, that's the way for ensuring cross-compatibility between different consumer software, i.e. every consumer software will be able to use a standard json schema validation library to validate the files.

constantinpape commented 3 years ago

I see two ways moving forward here:

I also lean towards option 1, but let's discuss this a bit more here and in the next meeting.

FynnBe commented 3 years ago

I favor option 2, because

joshmoore commented 3 years ago

Apologies for piping up, but I wanted to add that I've been looking into https://github.com/common-workflow-language/schema_salad for the NGFF work. It is the tool which underlies the CWL language. What I find interesting is the multiple representation pipeline. The schema gets written in a re-usable way (usuably YAML) and that can be used to JSON-LD contexts (which is what I'm focused on) or more static representations like Avro IDL, jsonschema, or even source code. I have a vague hope that we will be able to combine the metadata graph from an OME-Zarr with, e.g., a CWL workflow, and I've just started toying with the idea of doing that with an bioimage model :wink:

FynnBe commented 3 years ago

I've been looking into https://github.com/common-workflow-language/schema_salad

Interesting solution! Does this framework allow for validating inputs depending on other inputs? To stay with their quick start example, does it allow you to specify a rule like "if more than 5kg bananas are in the basket, cucumbers are not allowed"?

constantinpape commented 3 years ago

I've been looking into https://github.com/common-workflow-language/schema_salad for the NGFF work.

Glad you're thinking in similar terms for NGFF already; I wanted to bring up jsonschema in that context eventually.

I hadn't heard of salad before, I will have a look.

constantinpape commented 3 years ago

Ok, we decided in the meeting now, that we will use marshmallow as the reference source for the model.yaml and also use it to generate the description (https://github.com/bioimage-io/python-bioimage-io/pull/61), that will be rendered on https://bioimage.io/docs/#/.

constantinpape commented 3 years ago

The spec validator is now the source of truth and has to moved into this repository in #80.