CesiumGS / wetzel

Generate Markdown documentation from JSON Schema
Apache License 2.0
134 stars 54 forks source link

Either $ref resolution doesn't work, or $id is ignored. #81

Open JC3 opened 2 years ago

JC3 commented 2 years ago

Wetzel version: Whatever it is in git right now. OS: Windows 10 Node: 16.13.1

Given the following two schemas placed in a subdirectory named schemas:

schemas\a.json:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "http://example.com/schema/schema_a",
    "title": "schema a",
    "type": "object",
    "properties": {
        "something": { "type": "string" }
    }
}

schemas\b.json:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "http://example.com/schema/schema_b",
    "$ref": "http://example.com/schema/schema_a",
    "title": "schema b",
    "description": "schema a with an additional property",
    "type": "object",
    "properties": {
        "something_else": { "type": "string" }
    }
}

When I run:

node bin\wetzel.js -p schemas -i "[\"a.json\"]" schemas\b.json

Wetzel fails with:

Error: Unable to find $ref http://example.com/schema/schema_a
    at replaceRef (C:\...\wetzel\lib\replaceRef.js:54:19)

Why isn't it loading a.json and how do I make it find the references? Is my understanding of the -i option incorrect?

I tried hand-wavily adding -s schemas as well, but the result was the same.

Thanks!

JC3 commented 2 years ago

Couple new points:

  1. The following setup also does not work:
    • Change a.json's id to "$id": "http://example.com/schema/a.json"
    • Change b.json's id to "$id": "http://example.com/schema/b.json"
    • Change b.json's ref to "$ref": "http://example.com/schema/a.json"
  2. Neither does this:
    • Keep a.json's id as "$id": "http://example.com/schema/schema_a"
    • Keep b.json's id as "$id": "http://example.com/schema/schema_b"
    • Change b.json's ref to "$ref": "schema_a"
  3. The following setup does work:
    • Keep a.json's id as "$id": "http://example.com/schema/schema_a"
    • Keep b.json's id as "$id": "http://example.com/schema/schema_b"
    • Change b.json's ref to "$ref": "a.json"

(Note: Neither -p nor -s are relevant here. Also, as a sanity check to make sure it wasn't trying to resolve network resources, I tried all of the above with the xri scheme instead of http, and it did not affect the outcomes.)

In other words, the only working scenario is if the $ref is relative and matches the local filename, and the $ids are ignored.

This is very much not correct resolution behavior.

From 8.2.1 (emphasis mine; first two quoted paras included to give context to third):

The "$id" keyword identifies a schema resource with its canonical [RFC6596] URI. ... If present, the value for this keyword ... MUST represent a valid URI-reference [RFC3986] ... and MUST resolve to an absolute-URI. ... The absolute-URI also serves as the base URI for relative URI-references in keywords within the schema resource, in accordance with RFC 3986 section 5.1.1 [RFC3986] regarding base URIs embedded in content.

And 9.1.1 echoes this:

Unless the "$id" keyword described in an earlier section is present in the root schema, this base URI SHOULD be considered the canonical URI of the schema document's root schema resource.

Also, from 9.1.2 (emphasis mine):

The use of URIs to identify remote schemas does not necessarily mean anything is downloaded, but instead JSON Schema implementations SHOULD understand ahead of time which schemas they will be using, and the URIs that identify them. ... Implementations SHOULD be able to associate arbitrary URIs with an arbitrary schema and/or automatically associate a schema's "$id"-given URI, depending on the trust that the validator has in the schema. Such URIs and schemas can be supplied to an implementation prior to processing instances, or may be noted within a schema document as it is processed, producing associations as shown in appendix A.

Furthermore, from 9.2 (emphasis mine):

Schemas can be identified by any URI that has been given to them, including a JSON Pointer or their URI given directly by "$id". In all cases, dereferencing a "$ref" reference involves first resolving its value as a URI reference against the current base URI per RFC 3986 [RFC3986].

If the resulting URI identifies a schema within the current document, or within another schema document that has been made available to the implementation, then that schema SHOULD be used automatically.

In other words, unless there is a valid reason in some specific circumstance, resolution (for local files) is supposed to work like this:

  1. If there is an $id, that's the canonical URI, otherwise the local file URI is the canonical URI. This canonical URI is also the base URI.
  2. When $ref is encountered, if it's relative, then it is resolved to an absolute URI using the base URI.
  3. The schema whose canonical URI is the resolved $ref URI is used, regardless of what the canonical URI actually is -- i.e. not necessarily related to the local filename.

And the implication is that the documents should be loaded first, before $ref'ed schemas are resolved (or at least, in an appropriate order, circular refs notwithstanding), so that the canonical URIs can be determined and mapped to the appropriate [sub]schemas.

This means, then, that:

So either Wetzel is doing something wrong on the resolution end, or it's just plain ignoring $id. Not sure, but whatever it is, it appears to be non-compliant.

javagl commented 2 years ago

So either Wetzel is doing something wrong on the resolution end, or it's just plain ignoring $id. Not sure, but whatever it is, it appears to be non-compliant.

It's both. As far as I know, the $id is not accessed anywhere in the codebase of wetzel at all, and even if it was, it would certainly not be used for any sort of resolution.

I cannot point my finger at "the" reason. And I agree that we could consider to make wetzel more compliant to the specification in this regard. But some aspects to keep in mind:

These points may appear to be a bit shallow and handwaving. But maybe some background is relevant here: wetzel was mainly intended for generating the property reference for the glTF schema. The glTF schema uses IDs like "$id": "accessor.schema.json". So there wasn't so much effort put into implementing a 'JSON schema spec compliant resolution mechanism'. The focus is that it should "Work In Practice®". And at this point, the most important use case is that a $ref contains a file name (like in your "Case 3"), and this is resolved against whatever that file is supposed to refer to.

It may not be perfect in terms of spec compliance. But it works for glTF and other schemas.

An aside: In the refactored state that I pointed to in another issue, I tried to at least carry along some information about the 'base URI' together with the schema. This 'base URI' still consists of a 'directory name' in the current state, but at least, there is a structure for carrying that sort of information, which could either be derived from the $id or from the local file name. While still faaar from being perfect, it might be possible to come closer to the spec based on this state - see SchemaEntry.

JC3 commented 2 years ago

So either Wetzel is doing something wrong on the resolution end, or it's just plain ignoring $id. Not sure, but whatever it is, it appears to be non-compliant.

It's both. As far as I know, the $id is not accessed anywhere in the codebase of wetzel at all, and even if it was, it would certainly not be used for any sort of resolution.

I cannot point my finger at "the" reason. And I agree that we could consider to make wetzel more compliant to the specification in this regard. But some aspects to keep in mind:

  • It's complicated. The quoted statements like "implementations SHOULD understand ahead of time which schemas they will be using" and "Implementations SHOULD be able to associate arbitrary URIs with an arbitrary schema and/or automatically associate a schema's "$id"-given URI" would still leave me with the question: "What do 'understand' or 'associate' mean here, exactly, on the implementation level?".

Maybe you are overthinking it :). The meaning is straightforward and sensible, I think: Ahead of time = before resolving $ref references; understand = know what schemas are available already (e.g. Wetzel's -i option), associate = have the ability to look up an available schema given its canonical URI.

Really, it's pretty much the same set of informational requirements that would be needed to enable handling of circular references, except it would also include this information from some explicit list of available schemas (like -i) in addition to the schema-of-interest.


  • Schemas are usually not published at the ID path. The overly naive way of phrasing this is " http://example.com/schema/a.json yields a 404, so why should that work, exactly?". In order to actually work and be resolvable, each schema has to be associated with a "base URI" from where 'ref' schemas actually can be resolved. And this base URI can basically never be obtained from the $id anyhow.

I believe this may be the source of your reservations. The spec is very explicit about this matter:

From 8.2.1 (emphasis mine):

The "$id" keyword identifies a schema resource with its canonical [RFC6596] URI.

Note that this URI is an identifier and not necessarily a network locator. In the case of a network-addressable URL, a schema need not be downloadable from its canonical URI.

From 8.2.3 regarding resolution of references (emphasis mine):

The resolved URI produced by these keywords is not necessarily a network locator, only an identifier. A schema need not be downloadable from the address if it is a network-addressable URL, and implementations SHOULD NOT assume they should perform a network operation when they encounter a network-addressable URI.

Therefore the premise of the question, "if retrieving a URL indicated by the $id yields a 404 then why should it work?", is not valid: no attempt to access a URL indicated by the $id should ever have been made, and attempts to access $refs by URL are only an optional plan B (see below). That aside, the answer is: because the spec is very clear that the $id defines the canonical URI and that schemas should be identifiable by said URI, and that these URIs (which may not even have addressable schemes like http/file) aren't required to be the retrieval URLs.

I am not sure which draft made that explicitly clear but it would've been around draft 7. Draft 4 is where the role of ID was clarified and the idea of a "resolution scope" was introduced. The resolution scope was never linked to the retrieval URL; clarification on the "network operation" point as well as the idea of internal vs. external references was added later, but the intent was there in 04.

Note also these are URIs, not URLs, after all. The difference is that a URI (uniform resource identifier) names a resource without necessarily giving it a location, while a URL (uniform resource locator) provides a path to obtain the resource.

As an aside, I think the most confusing bit is that it's just become almost ubiquitous to use the "http" scheme in arbitrary URIs, and so examples become misleading. Personally, I think that they should've registered e.g. a "schema" URI scheme and stuck with that. It's for that reason that I actually prefer to use "xri" for identifiers where possible rather than "http". In fact, I may make a proposal along those lines, or to at least switch some of the examples over to a different scheme.

Additionally, you write:

And this base URI can basically never be obtained from the $id anyhow.

In fact, $id is specifically where the base URI comes from (8.2); noting, with the above in mind, that the base URI need not be an addressable URL. :) The key point to remember there is that there is this assumption that the implementation "understands" where to obtain the referenced schemas (for example, from a map of the canonical URIs of a user-provided list of additional schemas [potentially falling back on an actual filesystem/network request, more on that below]).


  • Things change rapidly and arbitrarily Wetzel was started with JSON schema draft-03 or draft-04, and many things have changed (or been clarified) in the meantime. There have been considerable changes in the mechanisms behind $id and $ref even between draft 2019-09 and draft 2020-12 (and this makes you wonder whether there will ever be a 'JSON Schema 1.0.0 (final)' ...). Keeping up with the subtle changes between these drafts is challenging.

Heh, yeah; and it doesn't help that they seem to be huge fans of "SHOULD" instead of "MUST"...

Still, the specification and behavior of "id" has been present since draft-03 and the basic definition "id identifies the schema, ref uses those ids in resolution" has never changed:

So, really, the modern behavior dates back to draft-04 or draft-05 (depending on what was taken to be implied in 04), and ids themselves go back to 03.

In other words, $id has always been supposed to identify a schema, and schemas have always been required to be addressable by their $id.


These points may appear to be a bit shallow and handwaving. But maybe some background is relevant here: wetzel was mainly intended for generating the property reference for the glTF schema. The glTF schema uses IDs like "$id": "accessor.schema.json". So there wasn't so much effort put into implementing a 'JSON schema spec compliant resolution mechanism'. The focus is that it should "Work In Practice®". And at this point, the most important use case is that a $ref contains a file name (like in your "Case 3"), and this is resolved against whatever that file is supposed to refer to.

And that would normally be totally fine -- Wetzel isn't obligated to do anything for strangers, heh -- except (and this is a lot of the motivation behind my post) that json-schema.org includes Wetzel in its documentation generator implementation list. Additionally, it states:

Wetzel: Generates Markdown and AsciiDoc. With some limitations, supports draft-3, draft-4, draft-7, and 2020-12.

Of course "some limitations" is completely reasonable, but schema IDs are a fundamental feature of JSON Schema. In my opinion, having a lack of ID support is somewhat (granted this is an exaggeration) like saying "Wetzel supports all versions of the spec, with some limitations" where "some limitations" includes inability to parse JSON. :)

IDs are in fact so fundamental the "identifier" is listed as one of the primitive keyword categories. Described in 7.4 as:

Identifiers define URIs for a schema, or affect how such URIs are resolved in references (Section 8.2.3), or both. The Core vocabulary defined in this document defines several identifying keywords, most notably "$id".

So e.g. in the Wetzel readme: "Currently it accepts JSON Schema drafts 3, 4, 7, and 2020-12" could be seen as misleading: Since it's missing a fundamental feature, it could be compellingly argued that it doesn't accept any of those drafts, even "with limitations".

To be clear, my intent isn't to dish out negative criticism or make demands. What I mean is: Wetzel can obviously do whatever you want it or need it to do, but I strongly feel that if it's not going to be given some more compliant behavior, then it at least ought to be removed from json-schema.org's front page given its current level of compliance.


It may not be perfect in terms of spec compliance. But it works for glTF and other schemas.

An aside: In the refactored state that I pointed to in another issue, I tried to at least carry along some information about the 'base URI' together with the schema. This 'base URI' still consists of a 'directory name' in the current state, but at least, there is a structure for carrying that sort of information, which could either be derived from the $id or from the local file name. While still faaar from being perfect, it might be possible to come closer to the spec based on this state - see SchemaEntry.

That's definitely helpful. Essentially, the following URIs are defined:

And resolution is performed by:

  1. Resolve a given reference URI to an absolute URI using the current context's base URI.
  2. Look up the associated schema in the map of available canonical URIs (and of course handle fragments as paths into those schemas here).
  3. And finally: Technically just a suggestion -- If no schema with the given canonical URI is available, and the URI happens to be some addressable URL (e.g. file, http), then treat it as an external reference and attempt to retrieve the schema (with appropriate security considerations of course).

Much of the above is actually defined in the URI RFC.

Incidentally, AJV's loadSchema callback is specifically provided to address step 3 in a compliant way: It allows implementations to determine how to resolve missing references.

Also note that the specific basic case of ...

... still works fine under the full resolution scheme, as all reference URIs would resolve to absolute URIs in the filesystem since root schema base URIs are their retrieval URIs when no $id is specified. This also still works even without a list of additional schemas: the implementation can choose to attempt to retrieve missing references, which here would just mean a perfectly fine filesystem read.


glTF

As for glTF, those schemas are technically non-compliant with the current draft.

In particular, note that 8.2.1.1 specifically calls for all root schema documents to not only contain an $id, but one that is itself an absolute URI:

The root schema of a JSON Schema document SHOULD contain an "$id" keyword with an absolute-URI [RFC3986] (containing a scheme, but no fragment).

Not only is that technically recommended, but in practice it becomes important in complex environments: If the glTF schema exists on a system with many other schemas and applications, then it is important for the glTF schemas to have absolute identifiers -- that is, those schemas can be referenced by IDs that do not depend on their location. There are many reasons for this that I won't go into since this is pretty long already. Also, in addition to location-independence, absolute URIs also of course provide namespacing.

Now, the thing here is: While you can currently say, "well, it works" (and that's fine), I could very reasonably go over to the glTF issue page and request that their schemas be given absolute URIs (don't worry, I won't, that'd be kind of a dick move given the current conversation, 😂). This request would be entirely justified and theoretically easy to implement; but it would not be possible given Wetzel's compliance level. The thing is: the glTF schema in its current form works with Wetzel and if it works, it works; but, otoh, the glTF schema will always remain in a form that coincides with Wetzel's limitations, because it would be silly to break a working system. That is, if you were to say "the primary motivation to update Wetzel is to keep up with glTF schema support", that equates to not being a motivation to update Wetzel, as the glTF schema is unlikely to change in a way that would require Wetzel to be updated (the path of least resistance is to just force glTF into Wetzel-compliant form).


Anyways... the TL;DR is that IDs are pretty fundamental and have been in the spec for a while, and even though Wetzel + glTF can work together without them, it would greatly improve Wetzel's usability outside of glTF and the most basic schemas.

javagl commented 2 years ago

And I thought that my issue comments were long 😌

Maybe you are overthinking it :). The meaning is straightforward and sensible

I might be overthinking this, but I have seen too many effects of 'underthinking', and this may just be a countermeasure. If you think that you can implement "The Right Solution®", then feel free to open a pull request. As long as the updated state is still generating the same output for glTF, the repository maintainers will probably be willing to merge it.

But if you try, just a word of warning:

know what schemas are available already (e.g. Wetzel's -i option),

The -i option is totally unrelated to the question which schemas are 'known'. Its sole purpose is to not include these schemas (i.e. their types) in the 'Table Of Contents'. The -s option might be closer related to that: It contains a 'search path'. But ... it's difficult (and you will have a hard time convincing me otherwise). There is a DRAFT PR at https://github.com/CesiumGS/wetzel/pull/71 for supporting multiple search paths, but this raises a bunch of questions (most obviously, how to deal with ambiguities, related to the fact that the $id is not really used as an 'ID' in wetzel...)


Therefore the premise of the question, "if retrieving a URL indicated by the $id yields a 404 then why should it work?", is not valid: .... for example, from a map of the canonical URIs of a user-provided list of additional schemas [potentially falling back on an actual filesystem/network request, more on that below])

I'm roughly (!) aware of some of these caveats. I occasionally looked at https://json-schema.org/understanding-json-schema/structuring.html , which explains some of these concepts on a slightly less formal way than the specs that you linked to (but I won't claim to have thoroughly understood all that, and admit that I did not read the technical version of the specs and all the RFCs that are necessary to really understand that).

My (somewhat shallow) understanding seems to be in line with what you said in a more profound and elaborate form. Roughly:

So this still leaves the question open about where and how exactly a $ref should be resolved. What is the actual URI for resolving a $ref? Yes, implementations SHOULD 'know' that....

const baseUrl = MagicalUrlFairy.whereAreWe();

but doing that in a 'spec-compliant' way that works in all cases that are covered by the spec, and (!) in all cases that appear in the real world can be difficult. Imagine you find a real-world schema that contains a $ref like

"$ref": "example.json"

You could argue that this is wrong, and it should use a proper ID (and that's correct). But that's not what's happening. So where, exactly, is the example.json schema file? That depends on where you found the schema that contains this $ref, and in the case of wetzel, this may have been found in one of the 'search paths' that have been given at the command line...

An aside: All this does not yet address the issue of fragments in $refs. Covering the cases of

"$ref": "#example"
"$ref": "#/definitions/example"
"$ref": "foo.schema.json#/definitions/example"
"$ref": "https://example.com/foo#/definitions/example"

is not entirely trivial. (Some related code is in some branch, but again: This is faaar from perfect - it just 'worked for me', as far as I needed it...)


You seem to read the specs on a more detailed level than I do. So maybe I can throw in that random question here, which I carved out as some sort of "quiz". Consider the following schema:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "base.schema.json",
    "title": "Base",
    "type": "object",

    "definitions": {
        "example": {
            "type": "string"
        }
    },

    "additionalProperties": {
        "$ref": "#/definitions/example"
    }

}

It defines definitions/example to be of type string.

Now consider this one, "extending" it (even though there is no real 'inheritance' going on) :

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "extended.schema.json",
    "title": "Extended",
    "type": "object",
    "$ref": "base.schema.json",

    "definitions": {
        "example": {
            "type": "number"
        }
    },

    "additionalProperties": {
        "$ref": "#/definitions/example"
    }
}

It defines definitions/example to be of type number.

Question 1: What type may the additional properties have so that they conform to the second schema?

Question 2: Are you sure about your answer to question 1.?


Still, the specification and behavior of "id" has been present since draft-03

I have read this 'change log', but admittedly, I will not read through each link. But a big 👍 for that nevertheless, because I might take a closer look at the links when this becomes immediately relevant for my work, and in any case, it is a useful overview (maybe for the case that someone wants to support multiple draft versions).

To summarize it, subjectively: The id/$id (sic!) was always present, but important details have changed considerably (or at least, been clarified) throughout the draft versions. If somebody was supposed to implement something like wetzel from scratch, 'on a green field', it would be far easier to look at the latest spec and follow it diligently. Or to directly address the bottom line:

In other words, $id has always been supposed to identify a schema, and schemas have always been required to be addressable by their $id.

This has never been followed in glTF, and it was never implemented in wetzel.


Of course "some limitations" is completely reasonable, but schema IDs are a fundamental feature of JSON Schema. ... To be clear, my intent isn't to dish out negative criticism or make demands. What I mean is: Wetzel can obviously do whatever you want it or need it to do, but I strongly feel that if it's not going to be given some more compliant behavior, then it at least ought to be removed from json-schema.org's front page given its current level of compliance.

That's all fine for me. I'm also only a user of wetzel. It does what it was intended for, but there are many aspects of the JSON schema that it did never handle correctly, and many aspects that it did never handle at all (roughly: because it wasn't necessary for glTF).

Or to put it that way: Wetzel MUST SHOULD be improved in many ways.


That's definitely helpful. Essentially, the following URIs are defined: ... And resolution is performed by: ...

I went through some of these steps/approaches while I tried to use wetzel for a more complex schema. I originally tried to do these changes incrementally, in a somewhat backward-compatible way. But at some point, I had to 'burn some bridges', because the necessary changes completely changed the original implementation, and of course, the refactored state is still far from perfect, and vastly different from something that one could do when...

I also considered to use the $id for actual lookups (i.e. as a real identifier), but considering that this is not sufficient for actually resolving a $ref, one still has to carry along the "actual base URL" together with the $id (i.e. one of the "search paths"). Some of that is addressed in the SchemaRepository of the refactored state, but not in a deeply spec-compliant way.


Incidentally, AJV's loadSchema callback ...

I occasionally looked at AJV. It is a project with ~11000 stars, ~2600 commits, ~150 releases, billion-dollar companies as sponsors, 180 contributors, (and still, 169 open issues and 29 pending pull requests). It's an entirely different category of project than wetzel. One may find some "inspiration" there, in terms of spec-compliant handling of details like $id and $ref. But carving out the relevant parts (and translating them to JavaScript) does not seem to be a reasonable approach - and even if someone did that: If the result was something that couldn't re-generate the glTF spec, verbatim, then it would be moot...

As for glTF, those schemas are technically non-compliant with the current draft. ... I could very reasonably go over to the glTF issue page and request that their schemas be given absolute URIs (don't worry, I won't, that'd be kind of a dick move given the current conversation, 😂).

I just did that dick move: https://github.com/KhronosGroup/glTF/issues/2182 . It is a valid point, so why not. The fact that glTF and wetzel are somewhat "coupled" should not prevent changes improvements on either side. But even when the $id in glTF are changed: This will not immediately affect wetzel. As I said: The $id is not used at all right now, so any way of taking it into account would require a considerable refactoring.

JC3 commented 2 years ago

I used to be self-conscious about my long comments, but now it's just 🤷‍♂️, haha. I actually edited it down.

Anyways, I will 100% read your reply and address what I can; at the moment I accidentally went down a bit of a rabbit hole. You might be interested in the active conversations at https://github.com/orgs/json-schema-org/discussions/197 -- in particular the thread starting from here.

javagl commented 2 years ago

I skimmed over that thread, but may have to re-read it (and some of the spec references mentioned here and there) to get a clearer picture.

A very high-level recommendation seems to be: "Rely on the $id (and sort out the "retrieval" independently of that)".

That certainly could simplify some structures and the implementation tremendously (sorting out actual responsibilities of code paths - divide et impera). And as I mentioned above: I tried that, to some extent - essentially, to populate the SchemaRepository from the refactored state, and then only use the $id for lookups in this repository.

But (with a bit of handwaving: ) given the lack of 'proper' IDs in real schemas, and the existing lookup mechanisms in wetzel, and the difficulties of sorting out and carrying along actual 'retrieval URIs', possible ambiguities for definitions, and (of course, the blanket excuse: ) a lack of time, I ended up with the state that is not ideal (as it only uses a real URL instead of the $id for lookups), but worked for my purposes...