dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

Support exporting STJ serialization contracts to JSON schema #100159

Closed eiriktsarpalis closed 5 months ago

eiriktsarpalis commented 8 months ago

Background and motivation

The recent popularity of function calling capabilities in LLMs as well as the upcoming OpenAPI work in ASP.NET Core has highlighted the importance of a System.Text.Json component that is capable of exporting its own serialization contracts (JsonTypeInfo) to JSON schema documents. Such a component should ideally satisfy the following criteria:

  1. The resultant JSON schema should honor configuration as specified in JsonSerializerOptions and POCO attribute annotations (e.g. JsonNamingPolicy, JsonNumberHandling, JsonPropertyName, JsonIgnore, etc.)
  2. The resultant JSON schema should be consistent with STJ serialization semantics. That is to say that all JSON produced by the serializer should be valid under the schema, and all JSON that is valid under the schema should be accepted by the deserializer.
  3. The component should support .NET standard 2.0 and .NET framework.
  4. The component should support source generated contracts and be compatible with Native AOT.

I wrote a prototype that attempts to address the above design goals, and this was largely achieved by tapping into the metadata exposed by the STJ contract model. That being said, the existing contract APIs do not expose all metadata that is necessary to construct a schema, so in many cases the implementation had to resort to private reflection or outright replication of STJ internals. At the same time, the core mapping logic itself requires acute understanding of STJ esoterica, so it cannot be expected that such a component could be sustainably maintained by third-party authors.

I'm creating this issue to track .NET 9 work related to JSON schema extraction. The scope is related to and overlaps with https://github.com/dotnet/runtime/issues/29887 but doesn't necessarily coincide with it. At a high level, it is tracking the following goals (in order of importance):

  1. Expose new APIs in the contract model that make it possible to extract the schema without use of private reflection or duplication of internal STJ logic.
  2. Add a built-in component mapping JsonTypeInfo contracts to JSON schema documents. Most users should able to use that directly, but would also serve as a reference implementation for those that want to map to bespoke formats (e.g. OpenAPI YAML).
  3. Add a JsonSchema exchange type. This is a stretch goal for .NET 9 since it would likely necessitate implementing support for the full JSON schema specification (whereas a mapper need only target a subset of the spec).

Work Items

  1. Contract API extensions & enhancements
  2. Built-in schema mapper
  3. ~JSON schema exchange type.~ (Cut for .NET 9)
gregsdennis commented 7 months ago

Is the label-tagger bot broken?

eiriktsarpalis commented 7 months ago

It was having issues a few weeks back.

benlongo commented 7 months ago

How does this interact with discriminated unions (e.g. [JsonPolymorphic(...)]? I'm not very familiar with the intricacies of STJ contracts and JSON schema vs. OpenAPI, but I believe there are some difficulties lying around here. My primary concern is that we will end up with broken OpenAPI schemas due to the inability to properly express discriminated unions (see https://swagger.io/docs/specification/data-models/inheritance-and-polymorphism/).

eiriktsarpalis commented 7 months ago

@benlongo after a bit of experimentation I ended up using anyOf in the prototype.

benlongo commented 7 months ago

Hi @eiriktsarpalis, thanks for looking into this! As a preface, I'm not very familiar with the intricacies of JSON Schema, so take anything I say with a grain of salt :)

Regarding the in-progress OpenAPI work (which I've left a related comment on https://github.com/dotnet/aspnetcore/issues/54598#issuecomment-2080215043), I'm concerned that the difference between JSON Schema and OpenAPI will cause paper cuts around discriminated unions; if the OpenAPI implementation is to naively delegate schema generation for discriminated unions, then things won't work properly. We use a lot of discriminated unions in our data model so I'm very invested in it working properly.

I'll use the example objects (modified slightly) from the OpenAPI 3.1.0 spec (https://swagger.io/specification/#discriminator-object). This translates to the following STJ model.

[ JsonPolymorphic( TypeDiscriminatorPropertyName = "petType" ) ]
[ JsonDerivedType( typeof(Cat), Cat.PetType ) ]
[ JsonDerivedType( typeof(Dog), Dog.PetType ) ]
[ JsonDerivedType( typeof(Lizard), Lizard.PetType ) ]
public abstract record Animal;

public record Cat : Animal {
    public const string PetType = "cat";

    public required string Name { get; init; }
}

public record Dog : Animal {
    public const string PetType = "dog";

    public required string Bark { get; init; }
}

public record Lizard : Animal {
    public const string PetType = "lizard";

    public required bool LovesRocks { get; init; }
}

In JSON Schema world, I would expect this to get mapped to something very similar to what you have in your prototype: an anyOf or oneOf with constant string discriminators.

In OpenAPI world however, discriminated unions are handled differently. I would expect the following OpenAPI schema to be generated for Animal.

Animal:
  oneOf:
    - $ref: '#/components/schemas/Cat'
    - $ref: '#/components/schemas/Dog'
    - $ref: '#/components/schemas/Lizard'
  discriminator:
    propertyName: petType
    mapping:
      cat: '#/components/schemas/Cat'
      dog: '#/components/schemas/Dog'
      lizard: '#/components/schemas/Lizard'

I don't think that the STJ JSON Schema library should be aware of OpenAPI peculiarities, but I definitely think the proper escape hatches need to exist so that the OpenAPI implementation can generate the correct schema. I have no idea what those escape hatches look like, or if they already exist, but I can imagine how a simple implementation of OpenAPI would result in this being difficult or impossible to express. The OpenAPI implementation will have to be aware of the underlying contract somehow to bypass JSON Schema generation for certain cases like this.

As an aside, based on https://json-schema.org/understanding-json-schema/reference/combining it seems like it could make more sense to use oneOf instead of anyOf (at least when discriminators are involved). However, I guess if every value has a discriminator then the consumer of a payload is guaranteed that anyOf implies oneOf. I'm not sure what the material difference of this is in the real world, but I could see anyOf making sense due to the highlighted performance consideration from the linked docs:

Careful consideration should be taken when using oneOf entries as the nature of it requires verification of every sub-schema which can lead to increased processing times. Prefer anyOf where possible.

One place I could anyOf going wrong in the real world is a TypeScript generator not realizing it can create a union type for all the variants of an anyOf. With oneOf, a TypeScript generator would not have to do any hard work to know that translating to a union is valid.

eiriktsarpalis commented 7 months ago

I don't think that the STJ JSON Schema library should be aware of OpenAPI peculiarities, but I definitely think the proper escape hatches need to exist so that the OpenAPI implementation can generate the following schema.

I agree with that sentiment, it's something we've been looking at solving with @captainsafia. The prototype uses a callback API that lets users append or modify JSON schema documents based on presence of particular properties, although this particular use case makes things trickier.

As an aside, based on https://json-schema.org/understanding-json-schema/reference/combining it seems like it could make more sense to use oneOf instead of anyOf

The problem with oneOf is that you could have two separate derived types whose schema matches a given JSON document (this is possible because type disriminators are optional in STJ).

gregsdennis commented 7 months ago

I'm concerned that the difference between JSON Schema and OpenAPI...

For some additional context, OpenAPI 3.1 is built on JSON Schema 2020-12 by default. Even previous versions of OpenAPI use a modified JSON Schema draft 4.

Discriminated unions aren't a problem that JSON Schema has. They're a problem that C# has.

The discriminator keyword is an OpenAPI addition. JSON Schema evaluation will return the content of the keyword as an annotation, where OpenAPI will continue processing. The oneOf/anyOf performs the actual validation to ensure that the data is expected; OpenAPI uses discriminator combined with the evaluation results to determine which subschema was valid.

benlongo commented 7 months ago

The problem with oneOf is that you could have two separate derived types whose schema matches a given JSON document (this is possible because type disriminators are optional in STJ).

For the fully general case, anyOf definitely makes sense. Perhaps oneOf should be reserved for cases where all variants have a defined discriminator?

I'm also not sure how one would define numeric discriminators in the OpenAPI mapping property.

For some additional context, OpenAPI 3.1 is built on JSON Schema 2020-12 by default. Even previous versions of OpenAPI use a modified JSON Schema draft 4.

Thanks for this context there. I took a quick skim through draft-05 through 2020-12 and didn't notice anything that should impact schema composition, but I also don't know how exactly the draft-04 version was modified in earlier versions of OpenAPI.

Discriminated unions aren't a problem that JSON Schema has. They're a problem that C# has.

Just so I understand what you're getting at here, my understanding is that serializing JSON discriminated unions used to be a problem for C# (particularly STJ), but is no longer an issue with [JsonPolymorphic]. I agree they are definitely not an issue for JSON Schema to express, but it is problematic that OpenAPI has this distinct method for encoding them despite using JSON Schema already being fully capable.

The discriminator keyword is an OpenAPI addition. JSON Schema evaluation will return the content of the keyword as an annotation, where OpenAPI will continue processing. The oneOf/anyOf performs the actual validation to ensure that the data is expected; OpenAPI uses discriminator combined with the evaluation results to determine which subschema was valid.

Thanks for explaining the annotation behavior - I was not aware of that. If I'm understanding this correctly, the OpenAPI additions (discriminator, etc.) are valid additions according to the JSON Schema spec as they are annotations. In this case, the escape hatches required to make discriminated unions work in OpenAPI world and JSON Schema world correctly may not have to be as extreme as I was imagining.

It sounds like you are describing one possible use of the OpenAPI document at runtime where there is a module validating based on JSON Schema, and then another module adding OpenAPI information onto these results (correct me if I'm wrong there). The only use case I have experience with is client code generation from the OpenAPI document. In this scenario, the JSON Schema may not be used for validation (directly at least), but rather for typing/parsing information. I don't believe that exact flow will always be taken, but the annotations being structurally valid JSON Schema is definitely relevant.

It seems like the addition of discriminator annotations for OpenAPI could potentially be done directly through the JSON Schema library as a post processing step. For example, it could pattern match on oneOf unions where all variants share a common property with constant value, but that seems like it could be brittle and expensive to compute. A cleaner alternative may be to hold onto the STJ contract alongside the generated JSON Schema as context so the OpenAPI generator can reach behind the curtain to figure out if it has a polymorphic type on its hands. Ultimately I think it comes down to whether the JSON Schema implementation wants to expose STJ contract or not.

The prototype uses a callback API that lets users append or modify JSON schema documents based on presence of particular properties.

Do these callbacks expose strictly JSON Schema information, or can STJ contract data be accessed through this interface?

Another possibly relevant issue (not really a bug per se) that I've run into with STJ is that sub-types of a [JsonPolymorphic] discriminated union do not have their discriminator serialized. If you have an endpoint that returns a specific sub-type in addition to the endpoint that returns the full union, you may run into issues if you expect that discriminator to be there.

gregsdennis commented 7 months ago

I took a quick skim through draft-05 through 2020-12 and didn't notice anything that should impact schema composition, but I also don't know how exactly the draft-04 version was modified in earlier versions of OpenAPI.

Draft 5 is basically the OpenAPI-specific draft 4 variant. Draft 5 is basically never supported outside of OpenAPI.

serializing JSON discriminated unions used to be a problem for C#

What I mean is that c# doesn't support unions, as such. Using discriminator as a mechanism to express polymorphism is one of the use cases, sure.

In this scenario, the JSON Schema may not be used for validation (directly at least), but rather for typing/parsing information.

Importantly, JSON Schema isn't a typing system. It's a constraints system. Henry Andrews' excellent blog post explains this difference well.

And code generation (either direction) isn't defined by any specification (yet), so whoever implements it is free to do what they want.

eiriktsarpalis commented 7 months ago

For the fully general case, anyOf definitely makes sense. Perhaps oneOf should be reserved for cases where all variants have a defined discriminator?

We could try to make the generator a bit more clever and emit oneOf where applicable, but I'm not sure what this would achieve from a validation perspective. Each element would be mutually exclusive in that case regardless, so being consistent with anyOf seems like a better trade-off.

gregsdennis commented 7 months ago

There is an important distinction between anyOf and oneOf: oneOf requires that an implementation evaluate all of the subschemas, whereas anyOf can be short-circuited. In general, anyOf is preferred, especially when it contains subschemas which are already mutually exclusive.