Roadmap - Githubissues

Stranger6667 commented 6 months ago

This is a somewhat detailed roadmap for the development of this crate and its predecessor jsonschema-rs. I haven't been very active with jsonschema-rs recently, but I have a desire to work on it together with the community towards 1.0.

Here I describe the vision for 1.0, how I see the path there, and how to make it actually happen.

This project has been a lot of fun for me over the years, I learned a lot building it and it was useful for me in other projects as well. I hope that it could serve the same purpose for others too.

Therefore I invite YOU to participate, especially if you have the use case for a fast JSON Schema validator and want to shape the development process. Feel free to open new issues/discussions and tag me there or ask any questions here (I want to make a discord chat eventually) or invite people who might be interested. A contributing guide will be added soon, but briefly, the contribution should be aligned with the goals described below.

I also plan to record the progress in a series of blog posts (monthly or so) in detail and share it here (and in the possible Discord channel / Twitter).

Please, let me know what you think!

:heart:

Goals

I am dissatisfied with the number of decisions I made in the original implementation, and now I want to properly address them during the design phase.

Support all recent JSON Schema drafts (starting from Draft 4, including all output formats)
Better performance (less memory fragmentation, no intermediate allocations, real lazy validation)
Async ref resolving
Easier customization (custom keywords, format checks for Python)
More flexibility (in API, in the error type, generic JSON input, arbitrary number precision)
Simpler implementation (current boxing-everything and "combined" keywords drive me nuts)
Reusable components (I want to have a crate as a building block for JSON Schema canonicalization)
Feature parity in Python bindings. TBH, I'd like to have it as a drop-in replacement for jsonschema (or, at least, have near-feature parity)
Better project structure (workspace, extracting CLI and some other components into sub-crates),

Non-goals

Backward compatibility. Reaching the goals WILL break the compatibility, especially around errors (it should not be bound to validation input) and FFI (e.g format check functions aren't usable outside Rust)
Macro-based validator compilation. For a long time, I wanted to try code generation like Python's fastjsonschema does, so this crate can just build a big function and avoid all the state machine transitions, but I'd put it for later.
WASM. Great to have, and keep the new version compatible, but don't do anything specific.
Bindings for other languages. While desirable, I'd put it for later.
nostd. I think it could be possible to store validator internals in the provided block of memory, but it is complex and definitely could be done later on.

Core ideas behind the vision

Vec-backed tree

The first idea is to store all keyword implementations in a single Vec, and work with them as an ID-tree or more generally.

This should reduce the memory fragmentation (I tried this approach in css-inline and it worked great) and therefore, improve performance.

Iterating over such a structure also helps to make the iteration over errors actually lazy. The current implementation collects validation results from subtrees into heap-allocated memory.

Any other thoughts on using state machines are welcome, not sure if this approach would be the most appropriate here

Independent storage

Then, store all the keyword data in a format that is independent of the input, i.e. avoid the dependency on serde_json.

This will remove overhead in Python bindings (which is often >70%), as there will be no need for direct serialization/deserialization (though there could be some intermediate costs anyway). Users will be able to implement it for their types, i.e. for some FFI wrappers, and use this library with C/C++ codebases.

Resolve external references upfront

Resolving all reachable external references upfront makes it possible to build a state machine that will properly address recursive references. Currently, resolving happens in runtime and there is RWLock which adds a sensible overhead.

This also isolates resolving as a separate step, making it easier to provide blocking/non-blocking interface to it.

Plan

On the high level, I think that the progress towards 1.0 could be split into a few phases:

Updating the current repo
Refine the API design
Reference resolving
State machine builder
State machine transitions
Implementing vocabularies (validation, applicators, etc)
Extensions
Bindings
Migration

There is a description for each of them below. Over time I am going to expand each phase into a separate issue with more details.

Update `jsonschema-rs` repo

Fixing issues that need immediate attention + updating the README to reflect the plans. There are a few issues I want to address in the main repo before diving into this one, specifically:

Make the tests work
- https://github.com/Stranger6667/jsonschema-rs/issues/443
- https://github.com/Stranger6667/jsonschema-rs/issues/450
Add Python 3.12 support
- https://github.com/Stranger6667/jsonschema-rs/issues/439

API design

Here I think there are still some unclear corners around custom validators, format checkers, and the validation output. I.e. I want to be sure that the API design is ergonomic, and feasible to implement.

Reference resolving

This phase will require implementing a more or less draft-agnostic reference resolving (in a separate crate), which will be pluggable into the main crate.

I took a lot of inspiration from Python's referencing and implemented some similar logic, but it is incomplete. I am also not sure if it should be that generic as referencing, because the scope here is only JSON Schema.

The implementation should be built around a generic JSON access interface, which is more or less done in the jsonlike crate.

It is great that there is a referencing test suite, so it can be verified :)

State machine compiler & execution

This part takes the input document, and all resolved resources and builds a state machine where each transition leads to another validation state and the next step performs some action. I.e. validation keywords can transition either to an "error" state or "pass" state, applicators are more complicated but I believe feasible to implement too.

Probably worth taking some inspiration here - however, I am leaning more towards DFA

Vocabularies

Each keyword implements some aspect of JSON Schema validation and should be reflected in the state machine execution.

Extensions

At this point, the overall structure should be clear and it is time to implement extensions like custom keywords, format checkers, etc. I am thinking about the approach taken in minijinja, i.e. it is a boxed function (Arc<dyn Fn>)

Bindings

After the API of the main crate is ready, the Python bindings will be updated. Generally, I don't see many issues here, but need to ensure that all types in the main crate are compatible with making the Python bindings flexible enough.

Migration

Replace all content in the jsonschema-rs repo with the new one, write the migration guide + rename the repo to jsonschema avoiding the -rs suffix.

TieWay59 commented 4 months ago

https://github.com/Stranger6667/jsonschema/blob/d6b19381a7f38ac560d842ece5aad6f4892a4320/src/compiler/mod.rs#L5-L11

May I ask here, is this Graph struct part of the plan? I focus on the iterability of the JSON result (if you remember me.). I saw this Graph is still used as some vector, maybe it's wip. I'm curious to ask.

Still, glad to see you still working on this project. I'll be following this repo if I can help in some part.

Stranger6667 commented 4 months ago

Yes, Graph is a part of the plan and it will be a vector akin to this one in css-inline. After some experiments, I found that this kind of layout with a vector and generational IDs is more performant than e.g. Rc-based or Box<dyn T>-based ones, assumingly because of the better locality.

It is still WIP though. At this point, I think I am mostly done with the public API, tested some ideas in jsonschema-rs (custom keyword API is there and works nicely as far as I can tell).

Now I am finishing an article about some jsonschema-rs details (re optimizations, will be out this week) and will work on:

Adjusting jsonschema-rs for its future API. I think migration could be smoother than I originally thought.
Will release a new version that includes some perf optimizations covered in the article + custom keyword implementation.
Collect some more feedback on the API and possible use cases, specifically around reference resolving like dynamicAnchor which I think requires more work. Some of the ref resolving will happen in runtime which may affect the graph structure, but not severely.
Implement spec-agnostic resolving before building the graph.

On the iteration side of things, my current idea is that there will be a trait that represents the way the graph is traversed. With this approach is_valid, validate & evaluate will have different traversal algorithms which will allow them to skip/fast-forward certain nodes for performance reasons. So, implementors of such a trait will receive this graph and will iterate nodes their own way + maybe it could be a customization point for some more exotic use cases like tracking statistics during iteration or something like this (I actually need it to produce coverage maps of a schema)

Hopefully, my answer sheds some light on the current progress and the next steps. I appreciate your questions and feedback, so, if you'd like to be involved we can proceed in discussing any topics you'd like to be more involved to, so we can find a way to collaborate.

P.S. Yes, I remember you :)

TieWay59 commented 4 months ago

I'll be free after 10th May, looking forward to read your article then~🎉

Stranger6667 / jsonschema-experimental

Roadmap #1

Goals

Non-goals

Core ideas behind the vision

Vec-backed tree

Independent storage

Resolve external references upfront

Plan

Update `jsonschema-rs` repo

API design

Reference resolving

State machine compiler & execution

Vocabularies

Extensions

Bindings

Migration

Stranger6667 / jsonschema-experimental

Roadmap #1

Goals

Non-goals

Core ideas behind the vision

Vec-backed tree

Independent storage

Resolve external references upfront

Plan

Update jsonschema-rs repo

API design

Reference resolving

State machine compiler & execution

Vocabularies

Extensions

Bindings

Migration

Update `jsonschema-rs` repo