Project: add validation for core data types (API, storage)

I've played fast and loose with validation in Orbit so far, but particularly as we use on-device storage and open up a public API, we need to get serious about this. We need validators for all our core data types. When both client and server receive data from the network, they must validate that request/response. When reading data from disk, we should validate that data.

Ideally, our validation strategy is fairly automatic: either we're automatically generating code based on our TS types, or we're using some strategy to infer a validator at runtime (e.g. like io-ts).

I'd love help evaluating the approaches we might take to this, making a call, implementing the infrastructure, and adopting it appropriately throughout the system.

Skip to the bottom for the recommendation.

Solution 1

Libraries like https://github.com/gcanti/io-ts or https://github.com/pelotom/runtypes could integrate into the existing stack with minimal effort.
- Both libraries require you to convert all TS defs to their own DSL, which locks us into using the library and would be hard to encapsulate.
These solutions don't extend easily to forward-looking problems like cross-platform codegen (i.e. native swift) or documentation generation.
Libraries like ts-runtime are a nice middle ground by allowing us to keep our existing TS definitions and transpiling validators from that.
- The library is still considered experimental and not recommended for production usages.

Solution 2

Use a Typescript to JSON Schema codegen to create API contracts and then use a JSON schema validator at run-time.
In practice, this would involve using typescript-json-schema to convert the TS definitions to a JSON schema. From here, ajv can be used to create run-time validators.
Pros
- Typescript is the source of truth.
- Platform agnostic
- Low-blast radius: required dependencies of this solution will not leak into other packages.
- Low-opt in cost. No existing code needs to be refactored; only new infrastructure needs to be added.
Cons
- Indirection: typescript types -> JSON Schema -> typescript validator
- Miss-leading developer flow. Consider the case where a developer changes the API schema and does not run the codegen.
- All the consumers of the type would would have the latest version however the validators would not reflect this change.
- If the validators are particularly strict, it might complain that the request contains an unrecognized field. If the dev does not think about the codegen or is unaware of it, this will lead to a tedious debugging session.

Solution 3

Convert all the TS definitions to JSON schema files, and use that as the source of truth instead. These files will be used to generate both validators and TS types.

Pros

API contracts are stored in a universal format which is not coupled to any language.
Platform agnostic.
Low-blast radius: required dependencies of this solution will not leak into other packages.
Other platforms could look to these schemas to run their own codegen.
Codegen needs to run to reflect any changes. Avoids the case where the changes are only partially reflected.
Easy to integration for documentation generation.
- Several JSON schema to HTML documentation generators exist like json-schema-for-humans or if we want more control over the documentation process, jsonschema2md.
Still get all the benefits of something like ajv, which autogenerates validators.
The api could serve the JSON-schemas to allow future third-party integrations to run code generation of their own.

Cons

Higher buy-in cost since the existing the TS defs need to be converted to JSON schema's.
JSON schema versioning seems to be source-breaking the past few bumps and has a graveyard of deprecated packages in the process.
Codegen in general is less than ideal since it usually hinders the feedback loop and requires a bit of extra infrastructure.

Why JSON-schema over something like OpenAPI/Swagger? From what I found, there seems to be stronger tooling around JSON schema. We could maybe use express-openapi-validator to validate the request and response bodies and openapi-typescript-codegen to generate TS definitions. However, there seem to be few other alternatives, and it is unclear to me right now if we could "hack" express-openapi-validator into working with cloud functions (would not handle client-side validation either).

Recommendation

I would recommend Solution 3. The existing TS definitions would be converted to be JSON-schemas. Then ajv would be used to load them and produce run-time validators and json-schema-to-typescript would be used to generate TS types. In theory, it looks like ajv could be used to infer TS types. However, I could not get the code sample to run, and there seem to be some limitations of this system (i.e. Boolean not being supported currently).

Let me know what you think, and if you believe there are other directions I should consider.

Thanks for this thoughtful write-up, @kirkbyo!

We'd like validation at a lower layer than the HTTP API layer

One key thing I notice reading this is that we really have two layers of desired types and validation. We have some core data types—things like ActionLog and Prompt which we want to validate both in the context of API requests/responses, and also when working with local serializations. And then we also have types representing specific APIs' request/response data formats, which we'd want to make use of those "core" data types.

This observation helps me understand why centering on solutions like OpenAPI is probably not quite right: that ecosystem is really focused on the request/response validation context, rather than validating arbitrary data types.

These types don't change that frequently

Then I considered whether it made more sense to generate JSON from TS or TS from JSON. I had been biased towards the former, but I notice now that these types don't change that frequently. In fact, the base types often can't without breaking our hashing scheme.

Given that, and given the flexibility of having some universally-parsable format as the source of truth, I'm biased in that direction. We don't have that many of these types; creating schemas from them shouldn't be that bad.

I like that JTD has a formal spec, though it concerns me a bit that it's quite new and not yet "ratified" in any meaningful way. But the fact that we don't have that many of these types makes me less concerned. If we need to move to some other inert schema format later, it won't be the end of the world.

Besides validation, we need canonicalization

Orbit's data format involves a hash graph, which depends on stable hashing semantics. This can be tricky in Javascript, since object types are "open". Any random function can just decide to add arbitrary keys to a prompt object, and then, when we hash it, we end up including the unintentional extra keys and generating a different hash from the one which would be generated without the interference. That's no good!

So core includes some "canonicalization" routines, which are quite related to validators. Given an object, they return a new object which includes only the "officially endorsed" keys (recursively).

We'll want our schema/validation strategy to accommodate this, too, to avoid duplicating the specification. I know that io-ts can do this. It looks like we can also do this with ajv by validating with the removeAdditional option enabled.

Performance considerations

If we use ajv's compiler (rather than loading schemas at runtime), these solutions should have essentially the same performance characteristics.

I'm in favor of going ahead with the third solution! How are you thinking about breaking down the work?

Gotcha, thanks for shining some light on the additional scope of this problem. I was overlooking the canonicalization routines. In that case, I am currently thinking of breaking down the tasks into the following PR's:

Set up validation at the HTTP layer on both the client and the backend. Since the api types consume a lot of the core types, this will be a little tricky. For this first PR, the API surface will be fully defined using JSON schemas however the definition of the core types will be duplicated. I won't convert the core types to be JSON schema's just yet to limit the blast radius. From here, I will set up ajv and json-schema-to-typescript and add a Github action to ensure that the Codegen has run before any merges. I hope this will expose any initial "gotcha's and allow us to pivot before investing too much.
Convert the core types to be JSON schema's instead and replace the canonicalization with ajv. I will remove the duplication introduced in the first PR, and instead reference the core types from the API schemas.
Validate local serialization? I'll admit, I am not quite sure what you mean by "local serializations". Regardless, we can loop back at this point to see where else validation is needed.

Let me know if you have any other ideas or reservations about this approach!

Got it, sounds reasonable!

Regarding the local serialization: I mean loading from a local LevelDB database (a few implementations in app). But actually, it probably makes sense to hold off on that part, since I'm going to be substantially rearchitecting it in the coming weeks (or so I keep saying!).

The last PR mentioned above marks the completion of this task. I will go ahead and close the issue for now, and we can either reopen it or create a new one once the data migration is completed.

In case anyone stumbles upon this issue in the future, I will leave a few notes on the implementation since we ran into a few hiccups while implementing Solution 3 outlined above.

Problems with Solution 3 When using json-schema-to-typescript there seems to be no way to avoid code duplication across packages. For example, say the core package defined its primitive types as a JSON schema and then generated TS code from it. The problem arises when the api package reference these schemas. When the TS codegen goes to resolve the reference, instead of importing the type from the other package it'll just generate the code again. This leaves the types to be defined in both places.

Protobuff's as an alternative Instead, we looked into using Protobuff's since the ecosystem seemed to be very mature and there is a good amount of support in the JS community. However, the proto language introduces two main limitations that make things less than ideal for our usage: lack of required fields and no support for union types (with good reason). This makes using protobuff's ill suited for supporting our existing API surface without refactoring. That being said, I really enjoyed working with the proto files compared to the JSON schema.

Solution 2 instead Ultimately, solution 2 was implemented. typescript-json-schema was used to generate TS types from the spec declared within orbitAPI.ts and then ajv is used to validate the contracts at run time. Having typescript-json-schema generate the types works great in term of developer experience and using their annotation syntax we can specify additional validation criteria. To prevent the JSON schema from falling out of sync with any code changes, a stale-api-schema job was added. This ensures that the schema has no pending changes before merging.

andymatuschak / orbit