Open Stranger6667 opened 6 months ago
May I ask here, is this Graph struct part of the plan? I focus on the iterability of the JSON result (if you remember me.). I saw this Graph is still used as some vector, maybe it's wip. I'm curious to ask.
Still, glad to see you still working on this project. I'll be following this repo if I can help in some part.
Yes, Graph
is a part of the plan and it will be a vector akin to this one in css-inline
. After some experiments, I found that this kind of layout with a vector and generational IDs is more performant than e.g. Rc
-based or Box<dyn T>
-based ones, assumingly because of the better locality.
It is still WIP though. At this point, I think I am mostly done with the public API, tested some ideas in jsonschema-rs
(custom keyword API is there and works nicely as far as I can tell).
Now I am finishing an article about some jsonschema-rs
details (re optimizations, will be out this week) and will work on:
jsonschema-rs
for its future API. I think migration could be smoother than I originally thought.dynamicAnchor
which I think requires more work. Some of the ref resolving will happen in runtime which may affect the graph structure, but not severely.On the iteration side of things, my current idea is that there will be a trait that represents the way the graph is traversed. With this approach is_valid
, validate
& evaluate
will have different traversal algorithms which will allow them to skip/fast-forward certain nodes for performance reasons. So, implementors of such a trait will receive this graph and will iterate nodes their own way + maybe it could be a customization point for some more exotic use cases like tracking statistics during iteration or something like this (I actually need it to produce coverage maps of a schema)
Hopefully, my answer sheds some light on the current progress and the next steps. I appreciate your questions and feedback, so, if you'd like to be involved we can proceed in discussing any topics you'd like to be more involved to, so we can find a way to collaborate.
P.S. Yes, I remember you :)
I'll be free after 10th May, looking forward to read your article then~🎉
This is a somewhat detailed roadmap for the development of this crate and its predecessor jsonschema-rs. I haven't been very active with
jsonschema-rs
recently, but I have a desire to work on it together with the community towards 1.0.Here I describe the vision for 1.0, how I see the path there, and how to make it actually happen.
This project has been a lot of fun for me over the years, I learned a lot building it and it was useful for me in other projects as well. I hope that it could serve the same purpose for others too.
Therefore I invite YOU to participate, especially if you have the use case for a fast JSON Schema validator and want to shape the development process. Feel free to open new issues/discussions and tag me there or ask any questions here (I want to make a discord chat eventually) or invite people who might be interested. A contributing guide will be added soon, but briefly, the contribution should be aligned with the goals described below.
I also plan to record the progress in a series of blog posts (monthly or so) in detail and share it here (and in the possible Discord channel / Twitter).
Please, let me know what you think!
:heart:
Goals
I am dissatisfied with the number of decisions I made in the original implementation, and now I want to properly address them during the design phase.
jsonschema
(or, at least, have near-feature parity)Non-goals
fastjsonschema
does, so this crate can just build a big function and avoid all the state machine transitions, but I'd put it for later.nostd
. I think it could be possible to store validator internals in the provided block of memory, but it is complex and definitely could be done later on.Core ideas behind the vision
Vec-backed tree
The first idea is to store all keyword implementations in a single
Vec
, and work with them as an ID-tree or more generally.This should reduce the memory fragmentation (I tried this approach in css-inline and it worked great) and therefore, improve performance.
Iterating over such a structure also helps to make the iteration over errors actually lazy. The current implementation collects validation results from subtrees into heap-allocated memory.
Independent storage
Then, store all the keyword data in a format that is independent of the input, i.e. avoid the dependency on
serde_json
.This will remove overhead in Python bindings (which is often >70%), as there will be no need for direct serialization/deserialization (though there could be some intermediate costs anyway). Users will be able to implement it for their types, i.e. for some FFI wrappers, and use this library with C/C++ codebases.
Resolve external references upfront
Resolving all reachable external references upfront makes it possible to build a state machine that will properly address recursive references. Currently, resolving happens in runtime and there is
RWLock
which adds a sensible overhead.This also isolates resolving as a separate step, making it easier to provide blocking/non-blocking interface to it.
Plan
On the high level, I think that the progress towards 1.0 could be split into a few phases:
There is a description for each of them below. Over time I am going to expand each phase into a separate issue with more details.
Update
jsonschema-rs
repoFixing issues that need immediate attention + updating the README to reflect the plans. There are a few issues I want to address in the main repo before diving into this one, specifically:
API design
Here I think there are still some unclear corners around custom validators, format checkers, and the validation output. I.e. I want to be sure that the API design is ergonomic, and feasible to implement.
Reference resolving
This phase will require implementing a more or less draft-agnostic reference resolving (in a separate crate), which will be pluggable into the main crate.
I took a lot of inspiration from Python's referencing and implemented some similar logic, but it is incomplete. I am also not sure if it should be that generic as
referencing
, because the scope here is only JSON Schema.The implementation should be built around a generic JSON access interface, which is more or less done in the
jsonlike
crate.It is great that there is a referencing test suite, so it can be verified :)
State machine compiler & execution
This part takes the input document, and all resolved resources and builds a state machine where each transition leads to another validation state and the next step performs some action. I.e. validation keywords can transition either to an "error" state or "pass" state, applicators are more complicated but I believe feasible to implement too.
Probably worth taking some inspiration here - however, I am leaning more towards DFA
Vocabularies
Each keyword implements some aspect of JSON Schema validation and should be reflected in the state machine execution.
Extensions
At this point, the overall structure should be clear and it is time to implement extensions like custom keywords, format checkers, etc. I am thinking about the approach taken in minijinja, i.e. it is a boxed function (
Arc<dyn Fn>
)Bindings
After the API of the main crate is ready, the Python bindings will be updated. Generally, I don't see many issues here, but need to ensure that all types in the main crate are compatible with making the Python bindings flexible enough.
Migration
Replace all content in the
jsonschema-rs
repo with the new one, write the migration guide + rename the repo tojsonschema
avoiding the-rs
suffix.