Stranger6667 / jsonschema-experimental

MIT License
2 stars 1 forks source link

Roadmap #1

Open Stranger6667 opened 6 months ago

Stranger6667 commented 6 months ago

This is a somewhat detailed roadmap for the development of this crate and its predecessor jsonschema-rs. I haven't been very active with jsonschema-rs recently, but I have a desire to work on it together with the community towards 1.0.

Here I describe the vision for 1.0, how I see the path there, and how to make it actually happen.

This project has been a lot of fun for me over the years, I learned a lot building it and it was useful for me in other projects as well. I hope that it could serve the same purpose for others too.

Therefore I invite YOU to participate, especially if you have the use case for a fast JSON Schema validator and want to shape the development process. Feel free to open new issues/discussions and tag me there or ask any questions here (I want to make a discord chat eventually) or invite people who might be interested. A contributing guide will be added soon, but briefly, the contribution should be aligned with the goals described below.

I also plan to record the progress in a series of blog posts (monthly or so) in detail and share it here (and in the possible Discord channel / Twitter).

Please, let me know what you think!

:heart:

Goals

I am dissatisfied with the number of decisions I made in the original implementation, and now I want to properly address them during the design phase.

Non-goals

Core ideas behind the vision

Vec-backed tree

The first idea is to store all keyword implementations in a single Vec, and work with them as an ID-tree or more generally.

This should reduce the memory fragmentation (I tried this approach in css-inline and it worked great) and therefore, improve performance.

Iterating over such a structure also helps to make the iteration over errors actually lazy. The current implementation collects validation results from subtrees into heap-allocated memory.

Any other thoughts on using state machines are welcome, not sure if this approach would be the most appropriate here

Independent storage

Then, store all the keyword data in a format that is independent of the input, i.e. avoid the dependency on serde_json.

This will remove overhead in Python bindings (which is often >70%), as there will be no need for direct serialization/deserialization (though there could be some intermediate costs anyway). Users will be able to implement it for their types, i.e. for some FFI wrappers, and use this library with C/C++ codebases.

Resolve external references upfront

Resolving all reachable external references upfront makes it possible to build a state machine that will properly address recursive references. Currently, resolving happens in runtime and there is RWLock which adds a sensible overhead.

This also isolates resolving as a separate step, making it easier to provide blocking/non-blocking interface to it.

Plan

On the high level, I think that the progress towards 1.0 could be split into a few phases:

  1. Updating the current repo
  2. Refine the API design
  3. Reference resolving
  4. State machine builder
  5. State machine transitions
  6. Implementing vocabularies (validation, applicators, etc)
  7. Extensions
  8. Bindings
  9. Migration

There is a description for each of them below. Over time I am going to expand each phase into a separate issue with more details.

Update jsonschema-rs repo

Fixing issues that need immediate attention + updating the README to reflect the plans. There are a few issues I want to address in the main repo before diving into this one, specifically:

API design

Here I think there are still some unclear corners around custom validators, format checkers, and the validation output. I.e. I want to be sure that the API design is ergonomic, and feasible to implement.

Reference resolving

This phase will require implementing a more or less draft-agnostic reference resolving (in a separate crate), which will be pluggable into the main crate.

I took a lot of inspiration from Python's referencing and implemented some similar logic, but it is incomplete. I am also not sure if it should be that generic as referencing, because the scope here is only JSON Schema.

The implementation should be built around a generic JSON access interface, which is more or less done in the jsonlike crate.

It is great that there is a referencing test suite, so it can be verified :)

State machine compiler & execution

This part takes the input document, and all resolved resources and builds a state machine where each transition leads to another validation state and the next step performs some action. I.e. validation keywords can transition either to an "error" state or "pass" state, applicators are more complicated but I believe feasible to implement too.

Probably worth taking some inspiration here - however, I am leaning more towards DFA

Vocabularies

Each keyword implements some aspect of JSON Schema validation and should be reflected in the state machine execution.

Extensions

At this point, the overall structure should be clear and it is time to implement extensions like custom keywords, format checkers, etc. I am thinking about the approach taken in minijinja, i.e. it is a boxed function (Arc<dyn Fn>)

Bindings

After the API of the main crate is ready, the Python bindings will be updated. Generally, I don't see many issues here, but need to ensure that all types in the main crate are compatible with making the Python bindings flexible enough.

Migration

Replace all content in the jsonschema-rs repo with the new one, write the migration guide + rename the repo to jsonschema avoiding the -rs suffix.

TieWay59 commented 4 months ago

https://github.com/Stranger6667/jsonschema/blob/d6b19381a7f38ac560d842ece5aad6f4892a4320/src/compiler/mod.rs#L5-L11

May I ask here, is this Graph struct part of the plan? I focus on the iterability of the JSON result (if you remember me.). I saw this Graph is still used as some vector, maybe it's wip. I'm curious to ask.

Still, glad to see you still working on this project. I'll be following this repo if I can help in some part.

Stranger6667 commented 4 months ago

Yes, Graph is a part of the plan and it will be a vector akin to this one in css-inline. After some experiments, I found that this kind of layout with a vector and generational IDs is more performant than e.g. Rc-based or Box<dyn T>-based ones, assumingly because of the better locality.

It is still WIP though. At this point, I think I am mostly done with the public API, tested some ideas in jsonschema-rs (custom keyword API is there and works nicely as far as I can tell).

Now I am finishing an article about some jsonschema-rs details (re optimizations, will be out this week) and will work on:

On the iteration side of things, my current idea is that there will be a trait that represents the way the graph is traversed. With this approach is_valid, validate & evaluate will have different traversal algorithms which will allow them to skip/fast-forward certain nodes for performance reasons. So, implementors of such a trait will receive this graph and will iterate nodes their own way + maybe it could be a customization point for some more exotic use cases like tracking statistics during iteration or something like this (I actually need it to produce coverage maps of a schema)

Hopefully, my answer sheds some light on the current progress and the next steps. I appreciate your questions and feedback, so, if you'd like to be involved we can proceed in discussing any topics you'd like to be more involved to, so we can find a way to collaborate.

P.S. Yes, I remember you :)

TieWay59 commented 4 months ago

I'll be free after 10th May, looking forward to read your article then~🎉