Closed andymatuschak closed 3 years ago
Skip to the bottom for the recommendation.
Use a Typescript to JSON Schema codegen to create API contracts and then use a JSON schema validator at run-time.
In practice, this would involve using typescript-json-schema to convert the TS definitions to a JSON schema. From here, ajv can be used to create run-time validators.
Pros
Cons
Convert all the TS definitions to JSON schema files, and use that as the source of truth instead. These files will be used to generate both validators and TS types.
Pros
ajv
, which autogenerates validators.Cons
Why JSON-schema over something like OpenAPI/Swagger?
From what I found, there seems to be stronger tooling around JSON schema. We could maybe use express-openapi-validator to validate the request and response bodies and openapi-typescript-codegen to generate TS definitions. However, there seem to be few other alternatives, and it is unclear to me right now if we could "hack" express-openapi-validator
into working with cloud functions (would not handle client-side validation either).
I would recommend Solution 3. The existing TS definitions would be converted to be JSON-schemas. Then ajv
would be used to load them and produce run-time validators and json-schema-to-typescript would be used to generate TS types. In theory, it looks like ajv
could be used to infer TS types. However, I could not get the code sample to run, and there seem to be some limitations of this system (i.e. Boolean not being supported currently).
Let me know what you think, and if you believe there are other directions I should consider.
Thanks for this thoughtful write-up, @kirkbyo!
One key thing I notice reading this is that we really have two layers of desired types and validation. We have some core data types—things like ActionLog
and Prompt
which we want to validate both in the context of API requests/responses, and also when working with local serializations. And then we also have types representing specific APIs' request/response data formats, which we'd want to make use of those "core" data types.
This observation helps me understand why centering on solutions like OpenAPI is probably not quite right: that ecosystem is really focused on the request/response validation context, rather than validating arbitrary data types.
Then I considered whether it made more sense to generate JSON from TS or TS from JSON. I had been biased towards the former, but I notice now that these types don't change that frequently. In fact, the base types often can't without breaking our hashing scheme.
Given that, and given the flexibility of having some universally-parsable format as the source of truth, I'm biased in that direction. We don't have that many of these types; creating schemas from them shouldn't be that bad.
I like that JTD has a formal spec, though it concerns me a bit that it's quite new and not yet "ratified" in any meaningful way. But the fact that we don't have that many of these types makes me less concerned. If we need to move to some other inert schema format later, it won't be the end of the world.
Orbit's data format involves a hash graph, which depends on stable hashing semantics. This can be tricky in Javascript, since object types are "open". Any random function can just decide to add arbitrary keys to a prompt object, and then, when we hash it, we end up including the unintentional extra keys and generating a different hash from the one which would be generated without the interference. That's no good!
So core
includes some "canonicalization" routines, which are quite related to validators. Given an object, they return a new object which includes only the "officially endorsed" keys (recursively).
We'll want our schema/validation strategy to accommodate this, too, to avoid duplicating the specification. I know that io-ts
can do this. It looks like we can also do this with ajv
by validating with the removeAdditional
option enabled.
If we use ajv
's compiler (rather than loading schemas at runtime), these solutions should have essentially the same performance characteristics.
I'm in favor of going ahead with the third solution! How are you thinking about breaking down the work?
Gotcha, thanks for shining some light on the additional scope of this problem. I was overlooking the canonicalization routines. In that case, I am currently thinking of breaking down the tasks into the following PR's:
api
types consume a lot of the core
types, this will be a little tricky. For this first PR, the API surface will be fully defined using JSON schemas however the definition of the core types will be duplicated. I won't convert the core types to be JSON schema's just yet to limit the blast radius. From here, I will set up ajv
and json-schema-to-typescript
and add a Github action to ensure that the Codegen has run before any merges. I hope this will expose any initial "gotcha's and allow us to pivot before investing too much.core
types to be JSON schema's instead and replace the canonicalization with ajv
. I will remove the duplication introduced in the first PR, and instead reference the core
types from the API
schemas. Let me know if you have any other ideas or reservations about this approach!
Got it, sounds reasonable!
Regarding the local serialization: I mean loading from a local LevelDB database (a few implementations in app
). But actually, it probably makes sense to hold off on that part, since I'm going to be substantially rearchitecting it in the coming weeks (or so I keep saying!).
The last PR mentioned above marks the completion of this task. I will go ahead and close the issue for now, and we can either reopen it or create a new one once the data migration is completed.
In case anyone stumbles upon this issue in the future, I will leave a few notes on the implementation since we ran into a few hiccups while implementing Solution 3 outlined above.
Problems with Solution 3
When using json-schema-to-typescript
there seems to be no way to avoid code duplication across packages. For example, say the core
package defined its primitive types as a JSON schema and then generated TS code from it. The problem arises when the api
package reference these schemas. When the TS codegen goes to resolve the reference, instead of importing the type from the other package it'll just generate the code again. This leaves the types to be defined in both places.
Protobuff's as an alternative Instead, we looked into using Protobuff's since the ecosystem seemed to be very mature and there is a good amount of support in the JS community. However, the proto language introduces two main limitations that make things less than ideal for our usage: lack of required fields and no support for union types (with good reason). This makes using protobuff's ill suited for supporting our existing API surface without refactoring. That being said, I really enjoyed working with the proto files compared to the JSON schema.
Solution 2 instead
Ultimately, solution 2 was implemented. typescript-json-schema
was used to generate TS types from the spec declared within orbitAPI.ts
and then ajv
is used to validate the contracts at run time. Having typescript-json-schema
generate the types works great in term of developer experience and using their annotation syntax we can specify additional validation criteria. To prevent the JSON schema from falling out of sync with any code changes, a stale-api-schema
job was added. This ensures that the schema has no pending changes before merging.
I've played fast and loose with validation in Orbit so far, but particularly as we use on-device storage and open up a public API, we need to get serious about this. We need validators for all our core data types. When both client and server receive data from the network, they must validate that request/response. When reading data from disk, we should validate that data.
Ideally, our validation strategy is fairly automatic: either we're automatically generating code based on our TS types, or we're using some strategy to infer a validator at runtime (e.g. like io-ts).
I'd love help evaluating the approaches we might take to this, making a call, implementing the infrastructure, and adopting it appropriately throughout the system.