Effect-TS / effect

An ecosystem of tools to build robust applications in TypeScript
https://effect.website
MIT License
7.56k stars 241 forks source link

From Discord: JSON Schema on Transformations #3016

Open effect-bot opened 4 months ago

effect-bot commented 4 months ago

Summary

Summary of Discussion

Initial Issue

Responses and Clarifications

Further Discussion

Conclusion

Key Takeaways

  1. Annotations on Transformations: There is a need to discuss how annotations like description, examples, etc., should be handled on transformation schemas.
  2. Custom Schemas: Users are encouraged to define their own schemas specific to their domain model for better control.
  3. JSONSchema as One Consumer: JSONSchema is just one possible consumer of a schema, and annotations should ideally be shared by all schema compilers.
  4. Proposal for Options: There is openness to adding options to JSONSchema.make to handle annotations on transformations, but it needs a detailed proposal.
  5. Typing Challenges: Providing examples for transformed schemas can lead to typing challenges, which need to be addressed.

Next Steps

Discord thread

https://discord.com/channels/795981131316985866/1251195898411876505

colinmorelli commented 4 months ago

Thanks for creating the issue! Following up from the conversation in Discord.

For context for others reading along, this discussion started due to my surprise at the following:

JSONSchema.make(Schema.DateFromString.annotations({ example: [new Date()], description: 'a date' }));

/*
{
  type: 'string',
  identifier: 'a string' // no inclusion of description or example
}
*/

I encourage others to read the thread for more context, but the "problem" here is that annotations attached to transform schemas are effectively ignored when using JSONSchema.make which results in an output that is potentially surprising. So, the main question here is how - or if - to propagate annotations from transform schemas when using JSONSchema.make. The secondary question, depending on the answer to the first, is how to handle that in the general case.

Options

As far as I see it, there are really two major options:

  1. (Current Behavior) Annotations attached to transform schemas are ignored. The argument here is that the annotations attached to the transform schema do not necessarily represent the input to the schema, which is what the JSONSchema is intended to document.
  2. Annotations attached to transform override annotations on the from side of the transform. The argument here is that the transform, by definition, encapsulates behavior of how to transition from the input type to the runtime type, and thus its annotations are relevant documentation to describe how the input will be interpreted.

There are certainly other options available but I don't think any other credible ones surfaced yet (please reply if you have an alternative!)

Note that in both of these situations, an assumption is that the jsonSchema annotation is always considered and overrides any other annotations in order to "fully declare" the JSONSchema for a given schema.

Proposal

I would propose changing the default behavior from option 1 to option 2 above. Effectively, description, identifier, and example properties declared on a transform schema itself would override the equivalent annotations on the from schema to which the transform applies.

My rationale is the following:

The immutably of schemas

Schemas have a built-in mechanism for adapting their annotations in specific use cases: their immutability. Using Schema.propertyDescriptor() to override examples or descriptions at a specific struct field is valuable but, as mentioned above, doesn't apply at all call sites. But it also seems unnecessary, because at any call site, and in any context, I can always take a schema I already have, and re-decorate it MySchema.annotations({ description: '' }) to override descriptions or examples that don't make contextual sense in the place I'm using the schema now.

Sensible defaults

In some of the most popular cases of transforms, such as DateFromString or NumberFromString, it would be reasonable expect Schema.NumberFromString.annotations({ example: [3] }) to produce a JSONSchema with the example present.

There are certainly cases where this might make less semantic sense. For example Schema.Lowercase.annotations({ description: 'a lowercase string' }) however, I'm skeptical this concern is worth designing around because:

  1. In most of the cases where this applies, the description is not wrong, it just doesn't cover all possible values that could be provided to the schema.
  2. This is primarily only a concern where the "description" field is used as a direct translation of what the literal runtime type of the data is, which is probably better suited as a code comment than a schema description.

On the second point above, consider the split transform: Schema.String.pipe(Schema.split(' ')).annotations({ description: 'a list of names' }). The conceptual contextual type of a "list" here is useful, but a description of "an array of strings" wouldn't make a ton of sense because that doesn't actually describe what the schema is anyway. It tells me the output type, but not it's contextual use.

To the extent the above doesn't apply - which I suspect is the minority of cases - users can always revert to simply overriding the annotation at their use site thanks to the immutability of schemas noted above. So they can explicitly out of this behavior with: MyDateSchema.annotations({ description: undefined /* or another description */ })

Preserving type safety

Another reason here would be to preserve type safety on annotations where this is relevant (such as examples). The schema itself knows how to convert the input and output types, and thus it is in the best position to actually manage that conversion for you. If we don't propagate examples through, the only way to provide examples at the use site of a transform schema would be to pass them through the jsonSchema annotation, which is untyped. This means we can't validate that an example actually conforms to the schema at compile time which creates an opportunity for examples to drift.

Preserving locality

The final reason for this proposal is to preserve the locality of descriptions and examples. Today, in order to provide a type safe example for a string-to-Date schema (without redeclaring the entirety of the jsonSchema), I would have to do so at schema creation time: const MyDateSchema = Schema.transform(Schema.String.annotations({ description: 'An ISO Date String' }), Schema.Date, ...) and I won't be able to easily modify that description at the point of use of this schema.

But this erodes the locality of my descriptions. Descriptions and examples are inherently contextual. At the time of creating that transform schema, all I know is that it represents some string that can become a date. But at a particular use site, I might have a better description, such as "the date at which this user was created." I could do that by Schema.propertyDescriptor(MyDateSchema).annotations({ ... }) but, as mentioned above, this doesn't work in all sites where I might want to use the schema (it only works on struct fields) and doesn't feel any more descriptive than just doing MyDateSchema.annotations({ }) anyway.

Concerns

The above proposal is not without its concerns. Specifically:

Transforming examples

In order to allow type-safe examples to be declared on transform schemas, we'd need to allow the user to provide the example in the runtime type of the schema. This means we'd have to be able to convert that to the input type for the same schema. The good news is that the encode function exists on the transform schema, so we know how to make that conversion already. The bad news is that encode function could be an effect with dependencies, or could have implicit dependencies that may need to be initialized before it would run correctly.

For this, I would propose two options: 1) JSONSchema.make is aware of the context of the schema and, assuming there is a required context, expects you to provide it in order to create the schema (so that it can safely call your encode functions), or 2) we make this an option, and require explicit opt-in with a context provided where necessary.

Which annotations to apply

Much of this discussion has focused on examples with a touch on description in a few places, but it's not clear the same rules should apply to all scenarios. My instinct is that all annotations should get the same treatment, but that's another point of contention.

Note

The above also applies to refinements, which raise an error if there is no jsonSchema defined on them. I would propose the same behavior applies to them. Such that:

Today: JSONSchema.make(Schema.String.pipe(Schema.filter((input) => input.startsWith('X'))) // throws error

After Change: JSONSchema.make(Schema.String.pipe(Schema.filter((input) => input.startsWith('X'))) // { type: 'string' }

gcanti commented 4 months ago

I've thought about it based on some principles and my opinion is that the current behavior is correct:

Principles

1) Transformations do not have any equivalent in JSON Schema, so it doesn't make sense to derive a JSON Schema from a schema that is a transformation. Instead, it should be derived from its from part (or its to part, although this is generally less useful).

2) Annotations attached to a schema should describe the characteristics or configurations of that schema, not how it is used in context. For contextual information, there are Schema.propertySignature (for structs) and Schema.element (for tuples).

import { Schema } from "@effect/schema"

const Millis = Schema.Int.annotations({ identifier: "Millis" })

const badPractice = Schema.Struct({
  createdAt: Millis.annotations({ description: "the date at which this object was created" }),
  updatedAt: Millis.annotations({ description: "the date at which this object was last updated" })
})

const goodPractice = Schema.Struct({
  description: Schema.String,
  createdAt: Schema.propertySignature(Millis).annotations({ description: "the date at which this object was created" }),
  updatedAt: Schema.propertySignature(Millis).annotations({
    description: "the date at which this object was last updated"
  })
})

Consequently:

I can always take a schema I already have, and re-decorate it MySchema.annotations({ description: '' }) to override descriptions or examples that don't make contextual sense in the place I'm using the schema now

This should be avoided (principle 2)

it would be reasonable expect Schema.NumberFromString.annotations({ example: [3] }) to produce a JSONSchema with the example present

I don't think that's a realistic expectation. The produced JSON Schema will pertain to a string, while the example is numeric.

Producing an example for the JSON Schema starting from an example set at the end of a transformation chain and using the encoding functions would be theoretically possible, but the production process would need to consider:

It needs to be evaluated whether, given these considerations, it is worth doing it. Moreover, it is not certain that a user would appreciate this automatic behavior, therefore, if implemented, it should probably be an opt-in behavior.

Type safety is not a concern if the example annotations are added where it is already natural to do so, namely, to the relevant schema.

In most of the cases where this applies, the description is not wrong, it just doesn't cover all possible values that could be provided to the schema.

Actually, it is wrong. If we produced a JSON Schema from Schema.Lowercase.annotations({ description: "a lowercase string" }) like this:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "string",
  "description": "a lowercase string",
  "title": "string"
}

it would be an error.

I could do that by Schema.propertyDescriptor(MyDateSchema).annotations({ ... }) but, as mentioned above, this doesn't work in all sites where I might want to use the schema (it only works on struct fields)

What other use cases exist for contextual information that are not within a struct or a tuple? If there are valid cases, the problem should be addressed in that context.

The above also applies to refinements, which raise an error if there is no jsonSchema defined on them. I would propose the same behavior applies to them

I think it's a good thing that it results in an error, as it gives the user the opportunity to correct the issue and make the output more precise.

colinmorelli commented 4 months ago

I don't think this behavior is "incorrect" (more on this at the end). But from a practical standpoint, I think the current approach will lead users of the library to do more work rather than having a sensible default that reflects real-world usage.

Transformations do not have any equivalent in JSON Schema, so it doesn't make sense to derive a JSON Schema from a schema that is a transformation. Instead, it should be derived from its from part (or its to part, although this is generally less useful).

Perhaps I already "disagree" with this principle. You're right, JSON Schema does not have a notion of transformations. But:

  1. One reason one would use a library like Schema, and its transforms, is because JSON itself does not allow one to encode complex type information. Presumably, if they're using JSONSchema.make then they may fall into this bucket. Which leads to:
  2. Transformations inherently have some type constraints attached to them, meaning the very act of transforming from one input to another necessarily expects input of a certain shape.

I do understand that you can apply all of those constraints on the from schema in most cases, but not all. There are cases in which the validation of the source data can't be done until the process of transforming. And, in those cases, it's completely reasonable to want to document, describe, or provide examples. One such example that recently came up in Discord:

const UserFromId = Schema.transform(
  Schema.UUID,
  UserSchema,
  {
    decode: (id) => db.lookup(id),
    encode: (entity) => entiy.id
  }
);

In the above schema, one could annotate the Schema.UUID with a description that says "a user identifier" but per your second principle, we shouldn't do this (that's a contextual use of the schema, not a property of a UUID itself), and so it'd be better to wrap in propertySignature, but I can't because this is a transform.

It also wouldn't make sense to create a "derived" UserUUID schema because there may not be any additional validation I can perform besides confirming it's a UUID. The next validation would simply be looking it up to see if it's valid.

In this case, it's not obvious to me where I should annotate the description that this represents a user identifier, other than redeclaring the jsonSchema on the transform, which would require me to redeclare most of the jsonSchema of UUID.

Annotations attached to a schema should describe the characteristics or configurations of that schema, not how it is used in context. For contextual information, there are Schema.propertySignature (for structs) and Schema.element (for tuples).

This seems an important one that I did not understand was a design goal of the library, but candidly to me doesn't make a ton of sense. I know you mention below in your response that one shouldn't create a "copy" of a schema by calling .annotations() and overriding a description but I suppose my question is: why? In the good practice vs bad practice example you mentioned, I'd truthfully much rather do the bad practice. It's both easier to read and doesn't give up any specificity. It's still quite clear a derived schema is being created with a new description.

To me, providing characteristics or configurations of the schema feels like a perfect use case for a code comment, which already exists. The descriptions feel vastly more useful for the consumers of such a schema which would want access to that description at runtime. Not someone with source-level access to the schema.

However, I recognize the above is an opinion and I don't have an argument besides "this makes more sense to me" and "it certainly feels a lot easier to use"


This should be avoided (principle 2)

See point above, but I recognize this is an opinion. This is your library, and if you feel this principle is important, I'll let this rest.

I don't think that's a realistic expectation. The produced JSON Schema will pertain to a string, while the example is numeric.

I mentioned later in my proposal that we can use the Transform's encode method to transform the example to a string. I've already had to tell the schema how to perform that transformation. So for me to have to perform it separately when providing an example feels duplicative.

Now, you rightfully point out the process could fail. That's fair. I'd wager a guess that we're optimizing for the vanishingly small number of scenarios in which 1) JSONSchema is being used, and 2) the encode function is not a simple property access or dependency free transformation of the runtime value. However, even to the extent it is the case, it doesn't have to be used. If it doesn't work for your use case, redeclare the jsonSchema annotation manually.

The current approach, though, doesn't actually allow me to declare an example at all without eschewing type safety or redeclaring the entire jsonSchema. That feels like an unintended consequence.

What other use cases exist for contextual information that are not within a struct or a tuple? If there are valid cases, the problem should be addressed in that context.

I wasn't aware of Schema.element, which does help. But in general my answer would be: there are transforms one could apply after a contextually described schema that don't change it's contextual description. A simple example would be trimming a string. MyContextualSchema.pipe(Schema.trim()) often (if not always) doesn't change the description I'd want to appear, but I couldn't apply this to a schema that has already been wrapped in propertyDescription or element


tl;dr It seems like we philosophically disagree mostly on the goal of "correctness." From my perspective, "correct" is a very difficult thing to define without introducing opinions anyway, is even if one can define it, is not always practically useful. Personally, I don't agree with how you defined correctness above, but I understand how you got there and think it's also a reasonable definition of correctness in this context.

I would personally be in favor of at least making the behavior I described above optional. Without it, personally, I'd rather maintain a separate JSONSchema conversion function that follows (roughly) the above than have to leverage propertySignature and element to provide contextual descriptions, and/or to redeclare jsonSchemas in totality.

gcanti commented 4 months ago

Maybe we should first agree on what we call "contextual".

In the above schema, one could annotate the Schema.UUID with a description that says "a user identifier" but per your second principle, we shouldn't do this (that's a contextual use of the schema, not a property of a UUID itself).

You can do it because it's not "contextual" information. The fact that the schema represents "a user identifier" in your domain model is a fact that doesn't depend on the context where that schema is used. While a description such as "the date at which this object was created" doesn't make sense on a schema itself, it does make sense in the context of a property signature. Therefore, such a description should be placed on the property signature.

To be clear, by "contextual" I mean schemas within schemas, particularly fields within structs and elements within tuples.

It also wouldn't make sense to create a "derived" UserUUID schema because there may not be any additional validation I can perform besides confirming it's a UUID. The next validation would simply be looking it up to see if it's valid.

I think it would still make sense, even without validations, because you could design your domain model to have two schemas representing a user ID: one unvalidated (a simple UUID) and one validated (a branded UUID).

I know you mention below in your response that one shouldn't create a "copy" of a schema by calling .annotations() and overriding a description but I suppose my question is: why?

What I mean is that you should avoid doing that if you are putting a description that doesn't override the meaning of that schema but instead refers to another one (specifically its from part if the schema is a transformation, as per your examples).

In the good practice vs bad practice example you mentioned, I'd truthfully much rather do the bad practice.

Of course, you can place annotations as you like and where you like. However, the point is that there is a contract between annotations and compilers like Pretty, Arbitrary, JSON Schema, etc. Compilers need to know where to find the relevant annotations. In this sense, the basic principle that annotations should concern the characteristics and configurations of the same schema they are attached to becomes important as a starting point.

To me, providing characteristics or configurations of the schema feels like a perfect use case for a code comment, which already exists. The descriptions feel vastly more useful for the consumers of such a schema who would want access to that description at runtime. Not someone with source-level access to the schema.

I’m not sure I understand your point. What do code comments have to do with this?

The current approach, though, doesn't actually allow me to declare an example at all without eschewing type safety or redeclaring the entire JSON Schema. That feels like an unintended consequence.

In my opinion, since we have generally expressed our views, it would be much better and faster at this point to consider a wide range of concrete cases with realistic descriptions and annotations to examine them.

Without it, personally, I'd rather maintain a separate JSONSchema conversion function.

That might not be necessary, but this is totally fine if we don't end up with a shared solution. One of the main goals of /schema, due to its introspectability, is to empower users to create their own schema consumers. The JSON Schema module is just one of them, aiming to be as useful as possible in the majority of cases. However, a custom compiler can sometimes be an excellent solution.