colinhacks / zod

TypeScript-first schema validation with static type inference
https://zod.dev
MIT License
33.9k stars 1.18k forks source link

RFC: Faster unions (vs `z.switch()`) #3407

Open colinhacks opened 6 months ago

colinhacks commented 6 months ago

This is a followup discussion to https://github.com/colinhacks/zod/issues/2106

Okay, I'm glad I didn't rush a z.switch() implementation, because I now think there's a better way forward. This is definitely my own brilliant idea and not something @gcanti suggested in a Twitter DM.

It's pretty simple, we just...make z.union() better. Here's the case for sticking with plain z.union():

How? The idea is for Zod to do some "pre-computation" at the schema creation time to make parsing faster.

The API could also accept a "discriminator hint" to point the parser in the right direction.

z.union([ ... ], {
  discriminator: "someKey"
})

In retrospect, this is what the API always should have been. Discriminated unions are not a type unto themselves, just an optimization over unions, and the API should have reflected that.

igalklebanov commented 6 months ago

LGTM! πŸ’ͺ

All Zod schemas will implement a .getLiterals(): Set method that returns the set of known literal values that will pass validation. For something like z.string(), this will be undefined (there are infinitely many valid inputs). For something like a ZodEnum or ZodLiteral, this will be a finite set of values.

Wondering if .getLiterals() should be optional (it's only relevant for types that should return something meaningful and not undefined), or inherited from base (returning undefined) and only overriden when it matters.

Am I missing something?

P.S. abort early! πŸŽ‰

ssalbdivad commented 6 months ago

I think this makes a lot of sense! ArkType goes pretty far down the "let's create a runtime type system and handle discrimination robustly" rabbit hole, but I think using heuristics like checking only literals at the root of the object can get you most of the benefit and save a ton of complexity.

The other category of disjoints besides literals that might be worth considering is typeof, e.g. if a string, number and boolean are at the same path. This often occurs incidentally, so it's a nice way to discriminate more unions without users having to think about it.

If it's of any use, this is the algorithm AT uses to determine the "best" discriminator (or sequence of discriminators, if it requires multiple steps) for a union:

https://github.com/arktypeio/arktype/blob/d000f90e8950917db19f2af437eafb22e390485b/ark/type/types/discriminate.ts#L64

Happy to discuss in more detail any time if it would be useful! Definitely one of my favorite subjects if you don't mind hearing me nerd out a bit over it πŸ˜„

jrysana commented 6 months ago

Am I understanding this right? Before:

z.discriminatedUnion("kind", [
  z.object({ kind: z.literal("number"), amount: z.number() }),
  z.object({ kind: z.literal("boolean"), value: z.boolean() })
])

After:

z.union([
  z.object({ kind: z.literal("number"), amount: z.number() }),
  z.object({ kind: z.literal("boolean"), value: z.boolean() })
])

Optional:

z.union([
  z.object({ kind: z.literal("number"), amount: z.number() }),
  z.object({ kind: z.literal("boolean"), value: z.boolean() })
], {
  discriminator: "kind"
})

The "after" schema will initialize a bit slower(?) than it would today, but parse much faster than it would today and up to around as fast as the "before" schema does today? If so, I like it a lot, discriminatedUnion always felt a bit redundant to write out by comparison and the parse speed boost for union schemas that happen to be discriminated would be great.

colinhacks commented 6 months ago

Wondering if .getLiterals() should be optional

Yep this would be defined on ZodType with a default implementation, as you say. It would probably also be protected...not sure this is an internal we should expose. It's true that this method would apply to very few schema types.

If it's of any use, this is the algorithm AT uses to determine the "best" discriminator (or sequence of discriminators, if it requires multiple steps) for a union:

Thanks David I'll look into this! I haven't thought through how discriminator selection would work, so this is super useful.

The other category of disjoints besides literals that might be worth considering is typeof

I was thinking about this. Maybe a special symbol (z.STRING) to indicate that any string would pass.

The "after" schema will initialize a bit slower(?) than it would today, but parse much faster than it would today and up to around as fast as the "before" schema does today?

Precisely πŸ’―

rexfordessilfie commented 6 months ago

LGTM and thanks for the breakdown of this neat pre-computation strategy πŸš€ !

I have a few clarifications for better understanding:

When you create a z.union(), Zod uses getLiterals() to extract a set of "fast checks" for each union element in the form {[k: string]: Set

How would the fast-checks generalize for unions where the "discriminator" is a non-primitive e.g an object?

Here's an example building on top of @jrysana example but nesting kind one level deeper in an info object:

z.union([
  z.object({ info: z.object({ kind: z.literal("number") }), amount: z.number() }),
  z.object({ info: z.object({ kind: z.literal("boolean") }), value: z.boolean() })
])

This question is based on my understanding of k in {[k: string]: Set<Primitive>} as referring to a field of an object in the discriminated union - I may be misunderstanding.

Side note: this schema may not be the best, but it seems like something that could exist out in the wild.

Lastly if this has already been thought through or it'd be figured out at a later time, I'm happy to wait and see what it would look like! Maybe this is what you are referring to as discriminator selection by:

Thanks David I'll look into this! I haven't thought through how discriminator selection would work, so this is super useful.


The other category of disjoints besides literals that might be worth considering is typeof

I was thinking about this. Maybe a special symbol (z.STRING) to indicate that any string would pass.

This seems like a nice idea to use special symbols (like z.STRING), in order to avoid conflicts with (string) literals as typeof would evaluate to.


Also, PS. I am glad that z.union is going to be preferred over z.switch as code generation is much easier with the former!

s-cerevisiae commented 6 months ago

Edit: Ok now I see it will solve my problem automatically. Thanks again for your great work, looking forward to the implementation.


Nice to see this is going to be addressed. Thank you for your hard work and ideas.

One thing that I'd like to see being introduced to union is precise error messages. Currently when using union for discriminated unions (as it's required for recursively defined data), the parser tries each branch thus gives a big error message:

const u = z.union([
  z.object({ type: z.literal("a"), value: z.number() }),
  z.object({ type: z.literal("b"), value: z.string() }),
]);

console.log(JSON.stringify(u.safeParse({ type: "a", value: "b" })));
long error message given by z.union ```js { "success": false, "error": { "issues": [{ "code": "invalid_union", "unionErrors": [{ "issues": [{ "code": "invalid_type", "expected": "number", "received": "string", "path": ["value"], "message": "Expected number, received string", }], "name": "ZodError", }, { "issues": [{ "received": "a", "code": "invalid_literal", "expected": "b", "path": ["type"], "message": 'Invalid literal value, expected "b"', }], "name": "ZodError", }], "path": [], "message": "Invalid input", }], "name": "ZodError", }, } ```

which is hard to digress and turns into wall of text when shown to the users.

OTOH since discriminatedUnion has an explicit discriminator assigned, it can precisely give the actual branch where the error happens:

const du = z.discriminatedUnion("type", [
  z.object({ type: z.literal("a"), value: z.number() }),
  z.object({ type: z.literal("b"), value: z.string() }),
]);

console.log(JSON.stringify(du.safeParse({ type: "a", value: "b" })));
short and precise error given by z.discriminatedUnion ```js { "success": false, "error": { "issues": [{ "code": "invalid_type", "expected": "number", "received": "string", "path": ["value"], "message": "Expected number, received string", }], "name": "ZodError", }, } ```

Will the "discriminator hint" solve this problem? Thank you.

lo1tuma commented 6 months ago

One thing that I'd like to see being introduced to union is precise error messages.

I second that. I use discriminatedUnion quite extensively and thatβ€˜s not because of its performance advantages, the most important reason is the precise error messages.

So far I really like the proposal, but I hope it addresses also the error messages, otherwise it will be unusable for me.

wKovacs64 commented 6 months ago

Forgive my ignorance, but would this allow for nested discriminated unions unlike z.discriminatedUnion?

arthurvanl commented 6 months ago

I am using this discriminatedUnion but I require that all objects have multiple shared keys but by only 1 shared key you can know the difference between the unions.

example:


const myUnion = z.discriminatedUnion("status", [
  z.object({ status: z.literal("success"), data: z.string(), timestamp: z.date() }),
  z.object({ status: z.literal("failed"), error: z.instanceof(Error), timestamp: z.date() }),
]);

myUnion.parse({ status: "success", data: "yippie ki yay" });

This is maybe a dumb example with the timestamp but that's what I would need to do if I want multiple shared keys. I thought I could maybe use a merge or extend, except discriminated unions do not allow this.

EDIT

This can actually be done with .and()

arthurvanl commented 6 months ago

New problem:

The status value must be dynamically allowed to. This throws an error now because the value is of course not found in the union (other than success or failed). I would like that it will not throw an error and is parsed with the default properties that are added in the schema

qraynaud commented 6 months ago

Why not instead of getLiterals() just make a more simple fastCheck(val): true | false | "unavailable" (or an enum) method instead? By default it would return "unavailable" but for literal you could return val === this.value. And it would allow an easy implementation on objects too with something like:

fastCheck(val) {
  let res: true | false | "unavailable" = "unavailable";
  for (const attr of this.attributes) {
     const fastCheck = attr.fastCheck(val[attr]);
     if (fastCheck === "unavailable") continue;
     else if (!fastCheck) return false;
     res = true;
  }
  return res;
}

Then to collect all possibilities in an union you could just do something like:

const possibleSchemas = allSchemas.filter(s => s.fastCheck(val) !== false);

Maybe there is not much value to "unavailable" and it could even be merged with true.

arthurvanl commented 6 months ago

I made my schema like this now:

export const store_order_item_schema = z.object({
    session_token: z.string().length(128),
    product_type: z.string(),
    product_name: z.string().max(200),
    value: z.string().max(200),
    period: z.number().positive().default(1),
    quantity: z.number().positive().default(1),
}).and(z.discriminatedUnion("product_type", [
    z.object({
        product_type: z.literal('domain_extension'),
        extension: z.string().refine((v) => !v.startsWith('.'), 'Domain extension must not start with a dot'),
        status: z.enum(['free', 'active']),
        transfer_code: z.string().optional(),
        //! you can not order multiple domains with the same name
        quantity: z.literal(1),
        period: z.literal(12).or(z.literal(24)).or(z.literal(36)).or(z.literal(48)).or(z.literal(60))
    }),
    z.object({
        product_type: z.literal('pax8_license'),
        term: z.enum(['Annual', 'Monthly']),
        license_id: z.string().uuid(),
        commitment_term_id: z.string().uuid(),
    }),
    z.object({
        product_type: z.literal('hosting'),
        hostname: z.string()
    }),
    z.object({product_type: z.literal('other')})
]))

The only thing what is bothering me is that I cannot use other values in this schema I would like to have the props session_token, product_type, product_name, value, period, quantity defaulty be added to any product_type I set.

AoiYamada commented 5 months ago

How about multiple discriminators that cause nesting issue like https://github.com/colinhacks/zod/issues/1884?

Would it be possible to accept an array of discriminators? for example:

z.union([ ... ], {
  discriminator: ["someKey1", "someKey2"]
})
gtflip commented 4 months ago

Could getLiterals() also be used to fix intersection parsing as well? E.g., in #2195

I started working on a function that would split the input object into two objects: one to pass to the ZodObject parser with just the keys that are in that object and one with the rest of the keys to pass to the ZodRecord parser. I got pulled off on another project, but there are also issues like dealing with strict(), passthrough(), etc.

jedwards1211 commented 4 months ago

z.union should automatically precompute the discriminator property for common cases, just like TypeScript does under the hood. I did this for unions in my old typed-validators library. discriminatorHint isn't even necessary unless the user wants to speed up that precomputation a bit.

The logic is basically, for all of the object options, does there exist some property such that each object has a different literal value for that property.

If you want to get more powerful, the precomputation could even check each object has a type for that property that doesn't overlap with the others (e.g. type A = { type: 1, value: string } | { type: 2 | 3, value: number } | { type: string, value: boolean }). This could be done with an _overlaps(type): boolean | undefined method on ZodType, where z.string()._overlaps(z.literal('foo')) is true, z.string()._overlaps(z.number()) is false, z.literal('foo')._overlaps(z.union(['foo', 'bar'])) is true, etc. _overlaps would be allowed to bail out and return undefined for hard cases, in which discriminator determination would assume there might be overlap.

z.union could precompute a discriminator for the object options even if some options aren't objects. Then when parsing, if the input is an object, you get the option for the discriminator value (if any), otherwise you fall back to other strategies to decide between the non-object options.

kpervin commented 3 months ago

What about situations where you need something like this:

import { z } from "zod";

const fooSchema = z
  .object({
    type: z.literal("foo"),
  })
  .and(
    z.discriminatedUnion("reqString", [
      z.object({
        reqString: z.literal(true),
        someString: z.string().min(1).max(10),
      }),
      z.object({
        reqString: z.literal(false),
        someString: z.string().optional(),
      }),
    ]),
  );

const barSchema = z.object({
  type: z.literal("bar"),
  barString: z.string(),
});

const bazSchema = z.object({
  type: z.literal("baz"),
});

const schema = z.discriminatedUnion("type", [fooSchema, barSchema, bazSchema]);

type Schema = z.infer<typeof schema>;

Currently it yields the following error:

TS2740: Type
ZodIntersection<ZodObject<{   type: ZodLiteral<"foo">; }, "strip", ZodTypeAny, {   type: "foo"; }, {   type: "foo"; }>, ZodDiscriminatedUnion<"reqString", [ZodObject<{   reqString: ZodLiteral<true>;   someString: ZodString; }, "strip", ZodTypeAny, {   reqString: true;   someString: string; }, {   reqString: true;   someString: string; }>, ZodObject<{   reqString: ZodLiteral<false>;   someString: ZodOptional<ZodString>; }, "strip", ZodTypeAny, {   reqString: false;   someString?: string | undefined; }, {   reqString: false;   someString?: string | undefined; }>]>>

is missing the following properties from type

ZodObject<{   type: ZodTypeAny; } & ZodRawShape, UnknownKeysParam, ZodTypeAny, {   [x: string]: any;   type?: unknown; }, {   [x: string]: any;   type?: unknown; }>
: _cached, _getCached, shape, strict, and 14 more.
qw-in commented 3 months ago

$0.02: name the option discriminatorHint (or maybe hint: { discriminator }) assuming it is only for perf and doesn't change the behaviour

// With `discriminator` I'd expect this to warn me I've done something dumb somehow or be a bug
// With `discriminatorHint` I understand maybe it's a perf boost to look at the `type` key first, even if it only gets close
z.union([
  z.object({ type: z.literal(1) }),
  z.object({ type: z.literal(2) }),
  z.object({ type: z.literal(2), data: z.string() }), // huh
  // ... probably more idk
], { discriminator: 'type' })
AlbertMarashi commented 1 month ago

@colinhacks is there any news on this? I can't continue dealing with the absolutely huge errors that arise from using unions. What was wrong with .switch?

gshpychka commented 1 month ago

What was wrong with .switch?

The issue description has some answers

toteto commented 1 week ago

Glad that the unions are getting an improvements. Got into this rabbit hole and ended in this issue after trying to optimize my complex schema.

I have complex schema that can have the "discriminator" be string literal or a (first) item in an array. The current approach with discriminatedUnion doesn't work, because it only allows static literal values for discriminator.

I want to propose that the discriminator property of the new union implementation be literal OR schema OR custom function.