GREsau / schemars

Generate JSON Schema documents from Rust code
https://graham.cool/schemars/
MIT License
797 stars 223 forks source link

#[schemars(required)] special case for Option<T> #337

Open functorism opened 1 week ago

functorism commented 1 week ago

[schemars(required)

When set on an Option field, this will create a schemas as though the field were a T.

Some consumers of JSON Schema such as OpenAI Structured Output require that all fields are required. They also require that "additionalProperties": false always be set (ensured by #[serde(deny_unknown_fields)]).

The special #[schemars(required) case for Option<T> makes it complicated to correctly express a schema which includes optional values (in the sense we want null as output). If this special case was opt-in, or there was an opt-out this friction would go away.

GREsau commented 1 week ago

Skimming through the OpenAI docs, it looks like this schema would define how a type is serialized rather than how it's deserialized, is that correct? If so, since the latest schemars alpha version (1.0.0-alpha.15), you can specify that you want the "serialize" schema by constructing a SchemaSettings and calling the for_serialize() method, e.g.:

let schema = SchemaSettings::default()
    .for_serialize()
    .into_generator()
    .into_root_schema_for::<MyStruct>();

Then, all fields will be included in the "required" array, even if they're Options. Of course, this assumes that none of the fields have #[serde(skip_serializing_if = ...)]/#[schemars(skip_serializing_if = ...)], because in that case it's possible for the field to not be included in the serialized output.

As for always setting "additionalProperties": false, you could either attach #[serde(deny_unknown_fields)] to all of your types, or alternatively you could write a custom Transform which forcibly sets that property on all schemas that define "properties". The easiest way to do this would be to define it as a fn(&mut Schema), and wrap it in RecursiveTransform to make it also apply to subschemas of the root schema e.g.:

RecursiveTransform(|schema: &mut Schema| {
    if schema.get("properties").is_some() {
        schema.insert("additionalProperties".to_owned(), false.into());
    }
})

I see there are some other restrictions on OpenAI structured output schemas, e.g. the docs don't mention the "const" or "oneOf" properties, so I assume those aren't supported either. Schemars has a built-in transform to replace "const" with a single-valued "enum" (ReplaceConstValue), and its easy enough to amend our custom transform above to also replace "oneOf" with "anyOf", which should behaves similarly enough for most cases.

None of the examples given in the OpenAI docs include a meta-schema ("$schema" property), so we can also clear the default meta_schema from our settings.

Putting it all together, we get:

let settings = SchemaSettings::default()
    .for_serialize()
    .with(|s| s.meta_schema = None)
    .with_transform(ReplaceConstValue)
    .with_transform(RecursiveTransform(|schema: &mut Schema| {
        if schema.get("properties").is_some() {
            schema.insert("additionalProperties".to_owned(), false.into());
        }

        if let Some(one_of) = schema.remove("oneOf") {
            schema.insert("anyOf".to_owned(), one_of);
        }
    }));

let schema = settings.into_generator().into_root_schema_for::<MyStruct>();

println!("{}", serde_json::to_string_pretty(&schema).unwrap());

For the example MyStruct/MyEnum types in the readme, this outputs the schema:

{
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "my_bool": {
      "type": "boolean"
    },
    "my_int": {
      "type": "integer",
      "format": "int32"
    },
    "my_nullable_enum": {
      "anyOf": [
        {
          "$ref": "#/$defs/MyEnum"
        },
        {
          "type": "null"
        }
      ]
    }
  },
  "additionalProperties": false,
  "required": [
    "my_int",
    "my_bool",
    "my_nullable_enum"
  ],
  "$defs": {
    "MyEnum": {
      "anyOf": [
        {
          "type": "object",
          "properties": {
            "StringNewType": {
              "type": "string"
            }
          },
          "additionalProperties": false,
          "required": [
            "StringNewType"
          ]
        },
        {
          "type": "object",
          "properties": {
            "StructVariant": {
              "type": "object",
              "properties": {
                "floats": {
                  "type": "array",
                  "items": {
                    "type": "number",
                    "format": "float"
                  }
                }
              },
              "additionalProperties": false,
              "required": [
                "floats"
              ]
            }
          },
          "additionalProperties": false,
          "required": [
            "StructVariant"
          ]
        }
      ]
    }
  }
}

...which I think conforms to the OpenAI requirements

functorism commented 1 week ago

First off, what an absolutely amazing and truly stellar reply. Very appreciated!

I will try this out as soon as possible.

e.g. the docs don't mention the "const" or "oneOf" properties,

At this moment the "strict" option for structured output does not support anyOf nor oneOf - but I think one can expect support to improve in the future. The relatively new "strict" output feature OpenAI has enabled has potential to be very powerful; and especially so when coupled with expressive but type safe tooling provided by libraries such as schemars (thank you!).

At the moment OpenAI does not support examples; which is schemars current strategy for the very powerful concept of providing schemas for term level values as opposed to type level contracts. This feature has in my opinion the opportunity to be exceptionally useful for working with the "strict" version for structured output. As it allows one to create generative flows where as decisions are concertized and information retrieved; we have an ergonomic way of guaranteeing inference happens with regards to it.

Since I believe it might take some time before OpenAI starts expanding the JSON Schema feature set for "strict" mode; I'm interested in supporting the basic premise of schema_for_value but in a way which does not rely on examples. This goes for the other restrictions that's been covered; like rewriting to enum and figuring out ways to avoid modeling with anyOf/oneOf (the later being the most difficult restriction, no doubt).

GREsau commented 1 week ago

At this moment the "strict" option for structured output does not support anyOf nor oneOf

Are you sure? I was going by https://platform.openai.com/docs/guides/structured-outputs/supported-schemas which says it does support anyOf, just not in the root object

functorism commented 1 week ago

Are you sure?

Ah, yes you are right; which means your oneOf -> anyOf transform is a great solution.

Which obviously makes things infinitely easier.

In regard to examples - do you have some grasp on how you think it could make the most sense to approach this? I believe the current way schema_for_value works is good in general; so any solution I would explore would be to work around OpenAI limitations specifically.

GREsau commented 1 week ago

In regard to examples - do you have some grasp on how you think it could make the most sense to approach this? I believe the current way schema_for_value works is good in general; so any solution I would explore would be to work around OpenAI limitations specifically.

I don't fully understand what you're trying to do - is it just that you'd like to generate a schema for an example value, but without the examples property in the output? That's easy enough to achieve by just removing that property once you've got the schema, e.g.:

let mut schema = schema_for_value!(json!({
    "i": 123,
    "s": "hello world",
    "o": {
        "a": [true, false],
    }
}));
schema.remove("examples");

This produces the schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "i": {
      "type": "integer"
    },
    "s": {
      "type": "string"
    },
    "o": {
      "type": "object",
      "properties": {
        "a": {
          "type": "array",
          "items": {
            "type": "boolean"
          }
        }
      }
    }
  }
}
functorism commented 1 week ago

It's more about producing a schema that expresses the exact values without relying on examples, meaning transforming everything to const in the schema.

GREsau commented 6 days ago

I'm still not sure I understand - you have a type that can only have one specific value? If so, there's not much point using schemars at all, you can "generate" a schema via something like serde_json::json!({ "const": my_exact_value })