Open JC3 opened 2 years ago
Couple new points:
"$id": "http://example.com/schema/a.json"
"$id": "http://example.com/schema/b.json"
"$ref": "http://example.com/schema/a.json"
"$id": "http://example.com/schema/schema_a"
"$id": "http://example.com/schema/schema_b"
"$ref": "schema_a"
"$id": "http://example.com/schema/schema_a"
"$id": "http://example.com/schema/schema_b"
"$ref": "a.json"
(Note: Neither -p
nor -s
are relevant here. Also, as a sanity check to make sure it wasn't trying to resolve network resources, I tried all of the above with the xri
scheme instead of http
, and it did not affect the outcomes.)
In other words, the only working scenario is if the $ref
is relative and matches the local filename, and the $id
s are ignored.
This is very much not correct resolution behavior.
From 8.2.1 (emphasis mine; first two quoted paras included to give context to third):
The "$id" keyword identifies a schema resource with its canonical [RFC6596] URI. ... If present, the value for this keyword ... MUST represent a valid URI-reference [RFC3986] ... and MUST resolve to an absolute-URI. ... The absolute-URI also serves as the base URI for relative URI-references in keywords within the schema resource, in accordance with RFC 3986 section 5.1.1 [RFC3986] regarding base URIs embedded in content.
And 9.1.1 echoes this:
Unless the "$id" keyword described in an earlier section is present in the root schema, this base URI SHOULD be considered the canonical URI of the schema document's root schema resource.
Also, from 9.1.2 (emphasis mine):
The use of URIs to identify remote schemas does not necessarily mean anything is downloaded, but instead JSON Schema implementations SHOULD understand ahead of time which schemas they will be using, and the URIs that identify them. ... Implementations SHOULD be able to associate arbitrary URIs with an arbitrary schema and/or automatically associate a schema's "$id"-given URI, depending on the trust that the validator has in the schema. Such URIs and schemas can be supplied to an implementation prior to processing instances, or may be noted within a schema document as it is processed, producing associations as shown in appendix A.
Furthermore, from 9.2 (emphasis mine):
Schemas can be identified by any URI that has been given to them, including a JSON Pointer or their URI given directly by "$id". In all cases, dereferencing a "$ref" reference involves first resolving its value as a URI reference against the current base URI per RFC 3986 [RFC3986].
If the resulting URI identifies a schema within the current document, or within another schema document that has been made available to the implementation, then that schema SHOULD be used automatically.
In other words, unless there is a valid reason in some specific circumstance, resolution (for local files) is supposed to work like this:
And the implication is that the documents should be loaded first, before $ref'ed schemas are resolved (or at least, in an appropriate order, circular refs notwithstanding), so that the canonical URIs can be determined and mapped to the appropriate [sub]schemas.
This means, then, that:
"$ref": "a.json"
-- which should resolve to http://example.com/schema/a.json
-- does not identify the URI of that file, which is in fact http://example.com/schema/schema_a
despite the filename). So either Wetzel is doing something wrong on the resolution end, or it's just plain ignoring $id
. Not sure, but whatever it is, it appears to be non-compliant.
So either Wetzel is doing something wrong on the resolution end, or it's just plain ignoring
$id
. Not sure, but whatever it is, it appears to be non-compliant.
It's both. As far as I know, the $id
is not accessed anywhere in the codebase of wetzel at all, and even if it was, it would certainly not be used for any sort of resolution.
I cannot point my finger at "the" reason. And I agree that we could consider to make wetzel more compliant to the specification in this regard. But some aspects to keep in mind:
$id
anyhow.draft-03
or draft-04
, and many things have changed (or been clarified) in the meantime. There have been considerable changes in the mechanisms behind $id
and $ref
even between draft 2019-09
and draft 2020-12
(and this makes you wonder whether there will ever be a 'JSON Schema 1.0.0 (final)' ...). Keeping up with the subtle changes between these drafts is challenging. These points may appear to be a bit shallow and handwaving. But maybe some background is relevant here: wetzel was mainly intended for generating the property reference for the glTF schema. The glTF schema uses IDs like "$id": "accessor.schema.json"
. So there wasn't so much effort put into implementing a 'JSON schema spec compliant resolution mechanism'. The focus is that it should "Work In Practice®". And at this point, the most important use case is that a $ref
contains a file name (like in your "Case 3"), and this is resolved against whatever that file is supposed to refer to.
It may not be perfect in terms of spec compliance. But it works for glTF and other schemas.
An aside: In the refactored state that I pointed to in another issue, I tried to at least carry along some information about the 'base URI' together with the schema. This 'base URI' still consists of a 'directory name' in the current state, but at least, there is a structure for carrying that sort of information, which could either be derived from the $id
or from the local file name. While still faaar from being perfect, it might be possible to come closer to the spec based on this state - see SchemaEntry
.
So either Wetzel is doing something wrong on the resolution end, or it's just plain ignoring
$id
. Not sure, but whatever it is, it appears to be non-compliant.It's both. As far as I know, the
$id
is not accessed anywhere in the codebase of wetzel at all, and even if it was, it would certainly not be used for any sort of resolution.I cannot point my finger at "the" reason. And I agree that we could consider to make wetzel more compliant to the specification in this regard. But some aspects to keep in mind:
- It's complicated. The quoted statements like "implementations SHOULD understand ahead of time which schemas they will be using" and "Implementations SHOULD be able to associate arbitrary URIs with an arbitrary schema and/or automatically associate a schema's "$id"-given URI" would still leave me with the question: "What do 'understand' or 'associate' mean here, exactly, on the implementation level?".
Maybe you are overthinking it :). The meaning is straightforward and sensible, I think: Ahead of time = before resolving $ref
references; understand = know what schemas are available already (e.g. Wetzel's -i
option), associate = have the ability to look up an available schema given its canonical URI.
Really, it's pretty much the same set of informational requirements that would be needed to enable handling of circular references, except it would also include this information from some explicit list of available schemas (like -i
) in addition to the schema-of-interest.
- Schemas are usually not published at the ID path. The overly naive way of phrasing this is " http://example.com/schema/a.json yields a 404, so why should that work, exactly?". In order to actually work and be resolvable, each schema has to be associated with a "base URI" from where 'ref' schemas actually can be resolved. And this base URI can basically never be obtained from the
$id
anyhow.
I believe this may be the source of your reservations. The spec is very explicit about this matter:
From 8.2.1 (emphasis mine):
The "$id" keyword identifies a schema resource with its canonical [RFC6596] URI.
Note that this URI is an identifier and not necessarily a network locator. In the case of a network-addressable URL, a schema need not be downloadable from its canonical URI.
From 8.2.3 regarding resolution of references (emphasis mine):
The resolved URI produced by these keywords is not necessarily a network locator, only an identifier. A schema need not be downloadable from the address if it is a network-addressable URL, and implementations SHOULD NOT assume they should perform a network operation when they encounter a network-addressable URI.
Therefore the premise of the question, "if retrieving a URL indicated by the $id yields a 404 then why should it work?", is not valid: no attempt to access a URL indicated by the $id should ever have been made, and attempts to access $refs by URL are only an optional plan B (see below). That aside, the answer is: because the spec is very clear that the $id defines the canonical URI and that schemas should be identifiable by said URI, and that these URIs (which may not even have addressable schemes like http/file) aren't required to be the retrieval URLs.
I am not sure which draft made that explicitly clear but it would've been around draft 7. Draft 4 is where the role of ID was clarified and the idea of a "resolution scope" was introduced. The resolution scope was never linked to the retrieval URL; clarification on the "network operation" point as well as the idea of internal vs. external references was added later, but the intent was there in 04.
Note also these are URIs, not URLs, after all. The difference is that a URI (uniform resource identifier) names a resource without necessarily giving it a location, while a URL (uniform resource locator) provides a path to obtain the resource.
As an aside, I think the most confusing bit is that it's just become almost ubiquitous to use the "http" scheme in arbitrary URIs, and so examples become misleading. Personally, I think that they should've registered e.g. a "schema" URI scheme and stuck with that. It's for that reason that I actually prefer to use "xri" for identifiers where possible rather than "http". In fact, I may make a proposal along those lines, or to at least switch some of the examples over to a different scheme.
Additionally, you write:
And this base URI can basically never be obtained from the
$id
anyhow.
In fact, $id
is specifically where the base URI comes from (8.2); noting, with the above in mind, that the base URI need not be an addressable URL. :) The key point to remember there is that there is this assumption that the implementation "understands" where to obtain the referenced schemas (for example, from a map of the canonical URIs of a user-provided list of additional schemas [potentially falling back on an actual filesystem/network request, more on that below]).
- Things change rapidly and arbitrarily Wetzel was started with JSON schema
draft-03
ordraft-04
, and many things have changed (or been clarified) in the meantime. There have been considerable changes in the mechanisms behind$id
and$ref
even betweendraft 2019-09
anddraft 2020-12
(and this makes you wonder whether there will ever be a 'JSON Schema 1.0.0 (final)' ...). Keeping up with the subtle changes between these drafts is challenging.
Heh, yeah; and it doesn't help that they seem to be huge fans of "SHOULD" instead of "MUST"...
Still, the specification and behavior of "id" has been present since draft-03 and the basic definition "id identifies the schema, ref uses those ids in resolution" has never changed:
So, really, the modern behavior dates back to draft-04 or draft-05 (depending on what was taken to be implied in 04), and ids themselves go back to 03.
In other words, $id
has always been supposed to identify a schema, and schemas have always been required to be addressable by their $id
.
These points may appear to be a bit shallow and handwaving. But maybe some background is relevant here: wetzel was mainly intended for generating the property reference for the glTF schema. The glTF schema uses IDs like
"$id": "accessor.schema.json"
. So there wasn't so much effort put into implementing a 'JSON schema spec compliant resolution mechanism'. The focus is that it should "Work In Practice®". And at this point, the most important use case is that a$ref
contains a file name (like in your "Case 3"), and this is resolved against whatever that file is supposed to refer to.
And that would normally be totally fine -- Wetzel isn't obligated to do anything for strangers, heh -- except (and this is a lot of the motivation behind my post) that json-schema.org includes Wetzel in its documentation generator implementation list. Additionally, it states:
Wetzel: Generates Markdown and AsciiDoc. With some limitations, supports draft-3, draft-4, draft-7, and 2020-12.
Of course "some limitations" is completely reasonable, but schema IDs are a fundamental feature of JSON Schema. In my opinion, having a lack of ID support is somewhat (granted this is an exaggeration) like saying "Wetzel supports all versions of the spec, with some limitations" where "some limitations" includes inability to parse JSON. :)
IDs are in fact so fundamental the "identifier" is listed as one of the primitive keyword categories. Described in 7.4 as:
Identifiers define URIs for a schema, or affect how such URIs are resolved in references (Section 8.2.3), or both. The Core vocabulary defined in this document defines several identifying keywords, most notably "$id".
So e.g. in the Wetzel readme: "Currently it accepts JSON Schema drafts 3, 4, 7, and 2020-12" could be seen as misleading: Since it's missing a fundamental feature, it could be compellingly argued that it doesn't accept any of those drafts, even "with limitations".
To be clear, my intent isn't to dish out negative criticism or make demands. What I mean is: Wetzel can obviously do whatever you want it or need it to do, but I strongly feel that if it's not going to be given some more compliant behavior, then it at least ought to be removed from json-schema.org's front page given its current level of compliance.
It may not be perfect in terms of spec compliance. But it works for glTF and other schemas.
An aside: In the refactored state that I pointed to in another issue, I tried to at least carry along some information about the 'base URI' together with the schema. This 'base URI' still consists of a 'directory name' in the current state, but at least, there is a structure for carrying that sort of information, which could either be derived from the
$id
or from the local file name. While still faaar from being perfect, it might be possible to come closer to the spec based on this state - seeSchemaEntry
.
That's definitely helpful. Essentially, the following URIs are defined:
$id
is present, then its $id
resolved against the retrieval URI (which is significant if $id
is not absolute). If $id
isn't present then it's just the retrieval URI (btw, 2020-12 8.2.1.1 recommends all root schemas to contain an $id
, with an absolute URI no less, although historically that wasn't always recommended). $id
, the subschema's base URI is its $id
resolved against the parent schema's base URI. Note that subschemas with an explicit $id
should then be treated as distinct schema documents (i.e. also added to the implementation's URI -> available schema mapping).And resolution is performed by:
Much of the above is actually defined in the URI RFC.
Incidentally, AJV's loadSchema
callback is specifically provided to address step 3 in a compliant way: It allows implementations to determine how to resolve missing references.
Also note that the specific basic case of ...
$id
in root schema (or $id
is documents relative filename)... still works fine under the full resolution scheme, as all reference URIs would resolve to absolute URIs in the filesystem since root schema base URIs are their retrieval URIs when no $id
is specified. This also still works even without a list of additional schemas: the implementation can choose to attempt to retrieve missing references, which here would just mean a perfectly fine filesystem read.
As for glTF, those schemas are technically non-compliant with the current draft.
In particular, note that 8.2.1.1 specifically calls for all root schema documents to not only contain an $id
, but one that is itself an absolute URI:
The root schema of a JSON Schema document SHOULD contain an "$id" keyword with an absolute-URI [RFC3986] (containing a scheme, but no fragment).
Not only is that technically recommended, but in practice it becomes important in complex environments: If the glTF schema exists on a system with many other schemas and applications, then it is important for the glTF schemas to have absolute identifiers -- that is, those schemas can be referenced by IDs that do not depend on their location. There are many reasons for this that I won't go into since this is pretty long already. Also, in addition to location-independence, absolute URIs also of course provide namespacing.
Now, the thing here is: While you can currently say, "well, it works" (and that's fine), I could very reasonably go over to the glTF issue page and request that their schemas be given absolute URIs (don't worry, I won't, that'd be kind of a dick move given the current conversation, 😂). This request would be entirely justified and theoretically easy to implement; but it would not be possible given Wetzel's compliance level. The thing is: the glTF schema in its current form works with Wetzel and if it works, it works; but, otoh, the glTF schema will always remain in a form that coincides with Wetzel's limitations, because it would be silly to break a working system. That is, if you were to say "the primary motivation to update Wetzel is to keep up with glTF schema support", that equates to not being a motivation to update Wetzel, as the glTF schema is unlikely to change in a way that would require Wetzel to be updated (the path of least resistance is to just force glTF into Wetzel-compliant form).
Anyways... the TL;DR is that IDs are pretty fundamental and have been in the spec for a while, and even though Wetzel + glTF can work together without them, it would greatly improve Wetzel's usability outside of glTF and the most basic schemas.
And I thought that my issue comments were long 😌
Maybe you are overthinking it :). The meaning is straightforward and sensible
I might be overthinking this, but I have seen too many effects of 'underthinking', and this may just be a countermeasure. If you think that you can implement "The Right Solution®", then feel free to open a pull request. As long as the updated state is still generating the same output for glTF, the repository maintainers will probably be willing to merge it.
But if you try, just a word of warning:
know what schemas are available already (e.g. Wetzel's
-i
option),
The -i
option is totally unrelated to the question which schemas are 'known'. Its sole purpose is to not include these schemas (i.e. their types) in the 'Table Of Contents'. The -s
option might be closer related to that: It contains a 'search path'. But ... it's difficult (and you will have a hard time convincing me otherwise). There is a DRAFT PR at https://github.com/CesiumGS/wetzel/pull/71 for supporting multiple search paths, but this raises a bunch of questions (most obviously, how to deal with ambiguities, related to the fact that the $id
is not really used as an 'ID' in wetzel...)
Therefore the premise of the question, "if retrieving a URL indicated by the $id yields a 404 then why should it work?", is not valid: .... for example, from a map of the canonical URIs of a user-provided list of additional schemas [potentially falling back on an actual filesystem/network request, more on that below])
I'm roughly (!) aware of some of these caveats. I occasionally looked at https://json-schema.org/understanding-json-schema/structuring.html , which explains some of these concepts on a slightly less formal way than the specs that you linked to (but I won't claim to have thoroughly understood all that, and admit that I did not read the technical version of the specs and all the RFCs that are necessary to really understand that).
My (somewhat shallow) understanding seems to be in line with what you said in a more profound and elaborate form. Roughly:
$id
is not really 'the place where the schema file can be found'
$id
can not be used as a basis for resolving $ref
s$id
might be used as a real identifier in wetzel (i.e. actually the key of a dictionary for something = dictionary[schema.$id]
), because it should be unique
$ref
that does not use an ID, but a filename. So this still leaves the question open about where and how exactly a $ref
should be resolved. What is the actual URI for resolving a $ref
? Yes, implementations SHOULD 'know' that....
const baseUrl = MagicalUrlFairy.whereAreWe();
but doing that in a 'spec-compliant' way that works in all cases that are covered by the spec, and (!) in all cases that appear in the real world can be difficult. Imagine you find a real-world schema that contains a $ref
like
"$ref": "example.json"
You could argue that this is wrong, and it should use a proper ID (and that's correct). But that's not what's happening. So where, exactly, is the example.json
schema file? That depends on where you found the schema that contains this $ref
, and in the case of wetzel, this may have been found in one of the 'search paths' that have been given at the command line...
An aside: All this does not yet address the issue of fragments in $ref
s. Covering the cases of
"$ref": "#example"
"$ref": "#/definitions/example"
"$ref": "foo.schema.json#/definitions/example"
"$ref": "https://example.com/foo#/definitions/example"
is not entirely trivial. (Some related code is in some branch, but again: This is faaar from perfect - it just 'worked for me', as far as I needed it...)
You seem to read the specs on a more detailed level than I do. So maybe I can throw in that random question here, which I carved out as some sort of "quiz". Consider the following schema:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "base.schema.json",
"title": "Base",
"type": "object",
"definitions": {
"example": {
"type": "string"
}
},
"additionalProperties": {
"$ref": "#/definitions/example"
}
}
It defines definitions/example
to be of type string
.
Now consider this one, "extending" it (even though there is no real 'inheritance' going on) :
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "extended.schema.json",
"title": "Extended",
"type": "object",
"$ref": "base.schema.json",
"definitions": {
"example": {
"type": "number"
}
},
"additionalProperties": {
"$ref": "#/definitions/example"
}
}
It defines definitions/example
to be of type number
.
Question 1: What type may the additional properties have so that they conform to the second schema?
string
number
Question 2: Are you sure about your answer to question 1.?
Still, the specification and behavior of "id" has been present since draft-03
I have read this 'change log', but admittedly, I will not read through each link. But a big 👍 for that nevertheless, because I might take a closer look at the links when this becomes immediately relevant for my work, and in any case, it is a useful overview (maybe for the case that someone wants to support multiple draft versions).
To summarize it, subjectively: The id
/$id
(sic!) was always present, but important details have changed considerably (or at least, been clarified) throughout the draft versions. If somebody was supposed to implement something like wetzel from scratch, 'on a green field', it would be far easier to look at the latest spec and follow it diligently. Or to directly address the bottom line:
In other words,
$id
has always been supposed to identify a schema, and schemas have always been required to be addressable by their$id
.
This has never been followed in glTF, and it was never implemented in wetzel.
Of course "some limitations" is completely reasonable, but schema IDs are a fundamental feature of JSON Schema. ... To be clear, my intent isn't to dish out negative criticism or make demands. What I mean is: Wetzel can obviously do whatever you want it or need it to do, but I strongly feel that if it's not going to be given some more compliant behavior, then it at least ought to be removed from json-schema.org's front page given its current level of compliance.
That's all fine for me. I'm also only a user of wetzel. It does what it was intended for, but there are many aspects of the JSON schema that it did never handle correctly, and many aspects that it did never handle at all (roughly: because it wasn't necessary for glTF).
Or to put it that way: Wetzel MUST SHOULD be improved in many ways.
That's definitely helpful. Essentially, the following URIs are defined: ... And resolution is performed by: ...
I went through some of these steps/approaches while I tried to use wetzel for a more complex schema. I originally tried to do these changes incrementally, in a somewhat backward-compatible way. But at some point, I had to 'burn some bridges', because the necessary changes completely changed the original implementation, and of course, the refactored state is still far from perfect, and vastly different from something that one could do when...
I also considered to use the $id
for actual lookups (i.e. as a real identifier), but considering that this is not sufficient for actually resolving a $ref
, one still has to carry along the "actual base URL" together with the $id
(i.e. one of the "search paths"). Some of that is addressed in the SchemaRepository
of the refactored state, but not in a deeply spec-compliant way.
Incidentally, AJV's
loadSchema
callback ...
I occasionally looked at AJV. It is a project with ~11000 stars, ~2600 commits, ~150 releases, billion-dollar companies as sponsors, 180 contributors, (and still, 169 open issues and 29 pending pull requests). It's an entirely different category of project than wetzel. One may find some "inspiration" there, in terms of spec-compliant handling of details like $id
and $ref
. But carving out the relevant parts (and translating them to JavaScript) does not seem to be a reasonable approach - and even if someone did that: If the result was something that couldn't re-generate the glTF spec, verbatim, then it would be moot...
As for glTF, those schemas are technically non-compliant with the current draft. ... I could very reasonably go over to the glTF issue page and request that their schemas be given absolute URIs (don't worry, I won't, that'd be kind of a dick move given the current conversation, 😂).
I just did that dick move: https://github.com/KhronosGroup/glTF/issues/2182 . It is a valid point, so why not. The fact that glTF and wetzel are somewhat "coupled" should not prevent changes improvements on either side. But even when the $id
in glTF are changed: This will not immediately affect wetzel. As I said: The $id
is not used at all right now, so any way of taking it into account would require a considerable refactoring.
I used to be self-conscious about my long comments, but now it's just 🤷♂️, haha. I actually edited it down.
Anyways, I will 100% read your reply and address what I can; at the moment I accidentally went down a bit of a rabbit hole. You might be interested in the active conversations at https://github.com/orgs/json-schema-org/discussions/197 -- in particular the thread starting from here.
I skimmed over that thread, but may have to re-read it (and some of the spec references mentioned here and there) to get a clearer picture.
A very high-level recommendation seems to be: "Rely on the $id
(and sort out the "retrieval" independently of that)".
That certainly could simplify some structures and the implementation tremendously (sorting out actual responsibilities of code paths - divide et impera). And as I mentioned above: I tried that, to some extent - essentially, to populate the SchemaRepository
from the refactored state, and then only use the $id
for lookups in this repository.
But (with a bit of handwaving: ) given the lack of 'proper' IDs in real schemas, and the existing lookup mechanisms in wetzel, and the difficulties of sorting out and carrying along actual 'retrieval URIs', possible ambiguities for definitions
, and (of course, the blanket excuse: ) a lack of time, I ended up with the state that is not ideal (as it only uses a real URL instead of the $id
for lookups), but worked for my purposes...
Wetzel version: Whatever it is in git right now. OS: Windows 10 Node: 16.13.1
Given the following two schemas placed in a subdirectory named schemas:
schemas\a.json:
schemas\b.json:
When I run:
Wetzel fails with:
Why isn't it loading a.json and how do I make it find the references? Is my understanding of the
-i
option incorrect?I tried hand-wavily adding
-s schemas
as well, but the result was the same.Thanks!