andymatuschak / orbit

Experimental spaced repetition platform for exploring ideas in memory augmentation and programmable attention
https://withorbit.com
Other
1.71k stars 54 forks source link

Project: add validation for core data types (API, storage) #184

Closed andymatuschak closed 3 years ago

andymatuschak commented 3 years ago

I've played fast and loose with validation in Orbit so far, but particularly as we use on-device storage and open up a public API, we need to get serious about this. We need validators for all our core data types. When both client and server receive data from the network, they must validate that request/response. When reading data from disk, we should validate that data.

Ideally, our validation strategy is fairly automatic: either we're automatically generating code based on our TS types, or we're using some strategy to infer a validator at runtime (e.g. like io-ts).

I'd love help evaluating the approaches we might take to this, making a call, implementing the infrastructure, and adopting it appropriately throughout the system.

kirkbyo commented 3 years ago

Skip to the bottom for the recommendation.

Solution 1

Solution 2

Solution 3

Convert all the TS definitions to JSON schema files, and use that as the source of truth instead. These files will be used to generate both validators and TS types.

Pros

Cons

Why JSON-schema over something like OpenAPI/Swagger? From what I found, there seems to be stronger tooling around JSON schema. We could maybe use express-openapi-validator to validate the request and response bodies and openapi-typescript-codegen to generate TS definitions. However, there seem to be few other alternatives, and it is unclear to me right now if we could "hack" express-openapi-validator into working with cloud functions (would not handle client-side validation either).

Recommendation

I would recommend Solution 3. The existing TS definitions would be converted to be JSON-schemas. Then ajv would be used to load them and produce run-time validators and json-schema-to-typescript would be used to generate TS types. In theory, it looks like ajv could be used to infer TS types. However, I could not get the code sample to run, and there seem to be some limitations of this system (i.e. Boolean not being supported currently).

Let me know what you think, and if you believe there are other directions I should consider.

andymatuschak commented 3 years ago

Thanks for this thoughtful write-up, @kirkbyo!

We'd like validation at a lower layer than the HTTP API layer

One key thing I notice reading this is that we really have two layers of desired types and validation. We have some core data types—things like ActionLog and Prompt which we want to validate both in the context of API requests/responses, and also when working with local serializations. And then we also have types representing specific APIs' request/response data formats, which we'd want to make use of those "core" data types.

This observation helps me understand why centering on solutions like OpenAPI is probably not quite right: that ecosystem is really focused on the request/response validation context, rather than validating arbitrary data types.

These types don't change that frequently

Then I considered whether it made more sense to generate JSON from TS or TS from JSON. I had been biased towards the former, but I notice now that these types don't change that frequently. In fact, the base types often can't without breaking our hashing scheme.

Given that, and given the flexibility of having some universally-parsable format as the source of truth, I'm biased in that direction. We don't have that many of these types; creating schemas from them shouldn't be that bad.

I like that JTD has a formal spec, though it concerns me a bit that it's quite new and not yet "ratified" in any meaningful way. But the fact that we don't have that many of these types makes me less concerned. If we need to move to some other inert schema format later, it won't be the end of the world.

Besides validation, we need canonicalization

Orbit's data format involves a hash graph, which depends on stable hashing semantics. This can be tricky in Javascript, since object types are "open". Any random function can just decide to add arbitrary keys to a prompt object, and then, when we hash it, we end up including the unintentional extra keys and generating a different hash from the one which would be generated without the interference. That's no good!

So core includes some "canonicalization" routines, which are quite related to validators. Given an object, they return a new object which includes only the "officially endorsed" keys (recursively).

We'll want our schema/validation strategy to accommodate this, too, to avoid duplicating the specification. I know that io-ts can do this. It looks like we can also do this with ajv by validating with the removeAdditional option enabled.

Performance considerations

If we use ajv's compiler (rather than loading schemas at runtime), these solutions should have essentially the same performance characteristics.


I'm in favor of going ahead with the third solution! How are you thinking about breaking down the work?

kirkbyo commented 3 years ago

Gotcha, thanks for shining some light on the additional scope of this problem. I was overlooking the canonicalization routines. In that case, I am currently thinking of breaking down the tasks into the following PR's:

  1. Set up validation at the HTTP layer on both the client and the backend. Since the api types consume a lot of the core types, this will be a little tricky. For this first PR, the API surface will be fully defined using JSON schemas however the definition of the core types will be duplicated. I won't convert the core types to be JSON schema's just yet to limit the blast radius. From here, I will set up ajv and json-schema-to-typescript and add a Github action to ensure that the Codegen has run before any merges. I hope this will expose any initial "gotcha's and allow us to pivot before investing too much.
  2. Convert the core types to be JSON schema's instead and replace the canonicalization with ajv. I will remove the duplication introduced in the first PR, and instead reference the core types from the API schemas.
  3. Validate local serialization? I'll admit, I am not quite sure what you mean by "local serializations". Regardless, we can loop back at this point to see where else validation is needed.

Let me know if you have any other ideas or reservations about this approach!

andymatuschak commented 3 years ago

Got it, sounds reasonable!

Regarding the local serialization: I mean loading from a local LevelDB database (a few implementations in app). But actually, it probably makes sense to hold off on that part, since I'm going to be substantially rearchitecting it in the coming weeks (or so I keep saying!).

kirkbyo commented 3 years ago

The last PR mentioned above marks the completion of this task. I will go ahead and close the issue for now, and we can either reopen it or create a new one once the data migration is completed.


In case anyone stumbles upon this issue in the future, I will leave a few notes on the implementation since we ran into a few hiccups while implementing Solution 3 outlined above.

Problems with Solution 3 When using json-schema-to-typescript there seems to be no way to avoid code duplication across packages. For example, say the core package defined its primitive types as a JSON schema and then generated TS code from it. The problem arises when the api package reference these schemas. When the TS codegen goes to resolve the reference, instead of importing the type from the other package it'll just generate the code again. This leaves the types to be defined in both places.

Protobuff's as an alternative Instead, we looked into using Protobuff's since the ecosystem seemed to be very mature and there is a good amount of support in the JS community. However, the proto language introduces two main limitations that make things less than ideal for our usage: lack of required fields and no support for union types (with good reason). This makes using protobuff's ill suited for supporting our existing API surface without refactoring. That being said, I really enjoyed working with the proto files compared to the JSON schema.

Solution 2 instead Ultimately, solution 2 was implemented. typescript-json-schema was used to generate TS types from the spec declared within orbitAPI.ts and then ajv is used to validate the contracts at run time. Having typescript-json-schema generate the types works great in term of developer experience and using their annotation syntax we can specify additional validation criteria. To prevent the JSON schema from falling out of sync with any code changes, a stale-api-schema job was added. This ensures that the schema has no pending changes before merging.