add notes on schema generation strategies

gregsdennis commented 4 months ago

I've added a bunch of notes regarding the decisions made for this POC. Mostly they boil down to:

format isn't a validating keyword by default. For the purposes of the tests, they can be treated as such in my lib by setting RequireFormatValidation = true. Other validators might have similar settings, but they'd be opt-in.
It's considered bad practice to $ref directly into the schema structure. It's recommended to extract common subschemas to a $defs keyword.
The true schema is preferred over the empty schema {} (especially as a subschema for additionalProperties, which historically has accepted booleans even before booleans were valid schemas).

There are a few other more targeted notes and corrections as well.

I don't intend for this PR to merge. It's just a vehicle for notes.

eiriktsarpalis commented 4 months ago

Thanks for the feedback. A few remarks on my side:

format isn't a validating keyword by default. For the purposes of the tests, they can be treated as such in my lib by setting RequireFormatValidation = true. Other validators might have similar settings, but they'd be opt-in.

If I'm understanding the issue correctly, it's not about the generated schema not being correct, but rather that the validation tests haven't been configured correctly?

It's considered bad practice to $ref directly into the schema structure. It's recommended to extract common subschemas to a $defs keyword.

That wouldn't address recursive types though, right?

The true schema is preferred over the empty schema {} (especially as a subschema for additionalProperties, which historically has accepted booleans even before booleans were valid schemas).

I did this intentionally because I wanted to ensure that the schema corresponding to each generated type is JsonObject to allow user modifications and additions via the OnSchemaGenerated callback. So for particular types {} is used instead of true and { "not" : "true" } is used instead of false.

gregsdennis commented 4 months ago

format isn't a validating keyword by default. For the purposes of the tests, they can be treated as such in my lib by setting RequireFormatValidation = true. Other validators might have similar settings, but they'd be opt-in.

If I'm understanding the issue correctly, it's not about the generated schema not being correct, but rather that the validation tests haven't been configured correctly?

Not just the validation tests. Any validator would be expected to ignore assertions for format. This is of primary concern when a generated schema is shared with another team which may use a different validator implementation.

It's considered bad practice to $ref directly into the schema structure. It's recommended to extract common subschemas to a $defs keyword.

That wouldn't address recursive types though, right?

Recursive types work just fine. Ideally, you can reference the root schema $ref: # or a subschema of $defs.

Linked list

{
  "type": "object",
  "properties": {
    "value": { "type": "integer" },
    "next": { "$ref": "#" }
  }
}

Object with linked list property

{
  "type": "object",
  "properties": {
    "list": { "$ref": "#/$defs/linked-list" }
  },
  "$defs": {
    "linked-list": {
      "type": "object",
      "properties": {
        "value": { "type": "integer" },
        "next": { "$ref": "#/$defs/linked-list" }
      }
    }
  }
}

You could also give the linked list an $id which would define that subschema as a resource. Then the $ref pointer could stay relative to that resource.

Object with identified linked list property

{
  "$id": "https://dotnet.org/schemas/linked-list-container",
  "type": "object",
  "properties": {
    "list": { "$ref": "linked-list" }  // resolves to https://dotnet.org/schemas/linked-list
  },
  "$defs": {
    "linked-list": {
      "$id": "linked-list",  // https://dotnet.org/schemas/linked-list
      "type": "object",
      "properties": {
        "value": { "type": "integer" },
        "next": { "$ref": "#" }  // resolves back to https://dotnet.org/schemas/linked-list
      }
    }
  }
}

eiriktsarpalis commented 4 months ago

You could also give the linked list an $id which would define that subschema as a resource.

One concern I had with using $id or $anchor is that from the perspective a general-purpose type-to-schema exporter there's no good way to generate non-conflicting identifiers without effectively embedding the fully qualified name of the type. I discussed that possibility with @captainsafia and we came to the conclusion that such an approach would be leaking implementation detail from the .NET type system into the generated schema.

gregsdennis commented 4 months ago

You could take the fully-qualified type name, hash it to a 128-bit int, and render it as a GUID URI. That would abstract it sufficiently from .Net while still providing a consistent map. Collisions could still occur, but would be extremely rare.

eiriktsarpalis commented 4 months ago

You could take the fully-qualified type name, hash it to a 128-bit int, and render it as a GUID URI.

Wouldn't that make it look like a solution file? Wouldn't get many points for human readability.

captainsafia commented 4 months ago

At least for the purposes of OpenAPI, there's a requirement for reference IDs to be human readable and semantically meaningful. The reference ID also has to be unique to the schema, and not the .NET type. For example, sequence-types that effectively map to the same schema (e.g. all types that would be represented by { type: "array", items: { type: "string" } } would be identified by the same reference ID in the top-level schema store (List<string>, IList<string>)

gregsdennis commented 4 months ago

there's a requirement for reference IDs to be human readable and semantically meaningful

OpenAPI v3.0- actually doesn't allow identifiers in schemas.

I've searched through 3.1, and I see no such requirement.

gregsdennis commented 4 months ago

The reference ID also has to be unique to the schema, and not the .NET type.

Then you'd have to identify unique subschemas and extract them. That's actually what I do in JsonSchema.Net.Generation. I find this approach works better because a type on its own will have a schema, but a property of that type could have additional requirements imposed by property-level attributes which would yield a different schema.

Scanning the generated schema structure (or tracking it during generation) for common subschemas (not common JSON data; they must be subschemas) will yield a good result.

captainsafia commented 4 months ago

OpenAPI v3.0- actually doesn't allow identifiers in schemas.

The "reference IDs" I'm referring to here are specifically in reference to the identifier used when constructing the url in the $ref property in a schema (e.g. ref: "#/components/schemas/ThisIsTheIdToConstruct).

gregsdennis commented 4 months ago

I recommend you have two processes:

generate the schema, with all references intact
embed the schema in some other kind of document, which performs pointer transformations as necessary

For this effort, don't worry about (2).

FWIW, OpenAPI is the reason my SchemaRegistry accepts and returns IBaseDocument instead of just JsonSchema. It recognizes that the root document may not itself be a schema. Everything needs an ID, though. If I get a schema that doesn't have one, I generate one.

The thing is, you're not going to get around needing to use $ref. It's the only way to solve recursion. $anchor isn't a bad option, actually.

{
  "type": "object",
  "properties": {
    "list": { "$ref": "#linked-list" }
  },
  "$defs": {
    "linked-list": {
      "$anchor": "linked-list",
      "type": "object",
      "properties": {
        "value": { "type": "integer" },
        "next": { "$ref": "#linked-list" }
      }
    }
  }
}

You just need to be sure your anchor is unique. Again, this is the problem you're facing, which is why I recommend less-readable GUIDs. (Editors generally have "find" functions.)

eiriktsarpalis / stj-schema-mapper

add notes on schema generation strategies #2