cue-lang / cue

The home of the CUE language! Validate and define text-based and dynamic configuration
https://cuelang.org
Apache License 2.0
5.14k stars 295 forks source link

encoding/jsonschema: metaschema URI `http://json-schema.org/schema#` not recognized #3461

Open andrewthauer opened 2 months ago

andrewthauer commented 2 months ago

What version of CUE are you using (cue version)?

$ cue version
cue version v0.10.0

Does this issue reproduce with the latest stable release?

Yes, I've also tried with 0.11.0-alpha.1 and same problem exists. This seems to be a regression introduced in >= 0.10.0.

What did you do?

I'm trying to run cue import on various 3rd party provided jsonschema files that contain the $schema URI: http://json-schema.org/schema#

For example:

FAIL

exec cue import

-- schema.json --
{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "foobar": {
      "type": "string",
      "title": "Foobar"
    }
  }
}

PASS in >= v.10.0:

exec cue import

-- schema.json --
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "foobar": {
      "type": "string",
      "title": "Foobar"
    }
  }
}

What did you expect to see?

The cue import imports the jsonschema without an error as it does in v0.9.2.

What did you see instead?

Fails with the following in >= v0.10.0.

invalid $schema URL "http://json-schema.org/schema#": $schema URI not recognized:

When playing around with URI's, it seems this URI is redirected to https://json-schema.org/draft/2020-12/schema. Perhaps there is an issue with redirect and/or the # fragment handling.

rogpeppe commented 2 months ago

Thanks for the report!

The issue with the metaschema URL http://json-schema.org/schema# in particular is that its meaning changes over time, according to which version is redirected to.

Currently the jsonschema.Extract logic knows about the specific schemas mentioned in the specifications, but it does not know about extended metaschemas, and it does not do any runtime URL fetching in order to acquire metaschemas for schemas it does not know about.

In general, a schema URI need not correspond to a "retrievable resource", so it's not necessarily even possible to do the right thing here in general.

The problem is that because the semantics of JSON Schema vary in a backwardly incompatible way from version to version, without knowing the actual version in play, we can't do a correct job of converting the schema.

I see a few possibilities for addressing this:

  1. reject such schemas (https://json-everything.net/json-schema/ is an example validator that takes this approach)
  2. ignore unknown metaschema IDs (https://www.jsonschemavalidator.net/ is an example validator that takes this approach), falling back to the default metaschema.
  3. provide an option to explicitly enable the above behaviour
  4. provide a way of overriding the metaschema ID in the configuration
  5. provide a way of mapping from arbitrary metaschema URIs to known metaschema URIs in the configuration
  6. add code to fetch and interpret metaschema URLs that are not known. Such code could interpret the [$vocabulary] keyword to determine which subsets of the metaschema are known and map those to the versions known by encoding/jsonschema.

ISTM that the last approach (6) is both complex to implement and involves side effects that might not be desirable. I'd prefer to avoid doing that for now at least.

2 is easiest to do, but runs the risk of hiding legitimate errors.

4 is awkward to define, as $schema directives can be nested: would the override apply to all nested $schema keywords, or just the outer level.

6 is going to be bulky and rather error-prone to specify - it makes more sense to do this in the Go API than on the command line, I think.

3 seems like a reasonable possibility, but the risk here is that the default metaschema version is not the version that the schema is actually written to.

For now, my gut feeling is go with 1: rejecting unknown metaschema URIs. That way, you as the invoker of cue import will be forced to decide which actual version should apply, and could apply it by writing a little script before invoking cue import, for example:

jq 'walk(
    if type == "object" and ."$schema" == "http://json-schema.org/schema#"
    then ."$schema" = "https://json-schema.org/draft/2020-12/schema"
    else .
    end
)'

@andrewthauer I'd be interested in your thoughts on the above.

andrewthauer commented 1 month ago

Thanks for the detailed breakdown of the issues @rogpeppe!

First, let me provide some context. We are using cue to validate helm values file. For our internal helm charts we use an embedded cue file and cue import on 3rd party charts that provide a jsonschema. This validation is fully automated via a custom script we've written to process everything provided by helm release tarballs.

I'm no expert on jsonschema by any stretch, so I can't really comment on the deeper implications of these choices. From my perspective, it would be ideal to not require any custom pre-processing of jsonschema files. That said, ignoring errors or making assumptions on the schema version could also potentially have strange and hard to track down side effects likely.

What I'm not sure on is why the schemas would be unknown? When a scheme isn't versioned would that not mean it maps to something akin to "latest"? The URI in question does map to 2020-12 as resolved from the redirect. I'm also a little confused and curious why these same jsonschema files work fine in v0.9.2 and what might have changed?

rogpeppe commented 1 month ago

When a scheme isn't versioned would that not mean it maps to something akin to "latest"?

The problem is that we can only tell that by doing an HTTP GET to the URL, and our current logic does not use any network connections at all. I'd suggest that anyone relying on "latest" logic for their JSON Schema version is asking for trouble, as semantics can and do vary from version to version.

That said, we could potentially encode the "latest" logic into the CUE binary, so a given CUE binary will always interpret http://json-schema.org/schema/# as the latest known version. However...

That said, ignoring errors or making assumptions on the schema version could also potentially have strange and hard to track down side effects likely.

the above is what concerns me. I wonder if there's any way to make it clear to your upstreams that they would be better off using an actual fixed schema version that corresponds to the version they're writing against. I wonder if these schemas actually are written against 2020-12 - it seems quite likely they might be written against another commonly used variant such as draft-07.

I'm also a little confused and curious why these same jsonschema files work fine in v0.9.2 and what might have changed?

The thing that's changed is that we now understand JSON Schema versions properly, according to the specification(s). Beforehand, the schema version was essentially ignored.

There is another possibility here: we could potentially provide a way of saying "if you encounter an unknown schema version, use this known schema version instead", but that feature would need some design work.

For now, I'd suggest that a preprocessing step probably remains the best option.

andrewthauer commented 1 month ago

Thanks again for breaking this down even more @rogpeppe. I can see why we would want to not to rely on network access when using the schema versions so this makes sense.

That said, we could potentially encode the "latest" logic into the CUE binary, so a given CUE binary will always interpret http://json-schema.org/schema/# as the latest known version.

I kind of think this might be the most ideal case since it's more or less what is being asked for right? Even though to your point it might not be known by the author. I'm always pro less work on the consumer of tools, so longer term solutions that are hands off are more appealing to me. Assuming they don't hide hard to debug issues when things fail.

I wonder if there's any way to make it clear to your upstreams that they would be better off using an actual fixed schema version that corresponds to the version they're writing against. I wonder if these schemas actually are written against 2020-12 - it seems quite likely they might be written against another commonly used variant such as draft-07.

The helm charts we are using are open source so I could either nudge about this in GH issues or open PRs. Luckily we don't use a lot of them for this purpose so that might be ok. I think the problem here is I suspect a lot of folks copy examples from other places and don't give it much thought to the implications. So this could be a very wide spread problem potentially that would be hard to squash.

For now, I'd suggest that a preprocessing step probably remains the best option.

Yeah, this shouldn't be too much trouble to work around for our current use case.