Evaluate the JSON Toolkit Benchmarks

Julian commented 4 days ago

@jviotti points out that they put together some nice benchmarks as part of his JSON toolkit validator.

They live here.

We should include any + all that aren't already "well covered" by any existing benchmark (which may not be any, I haven't looked at the suite yet).

jviotti commented 3 days ago

Hey @Julian , the link you provided are localised Google Benchmark cases at the repo level. The "real" benchmark across implementations is taking place here: https://github.com/sourcemeta-research/jsonschema-benchmark. So far, we are just publishing results through GitHub Actions. For example, see this: https://github.com/sourcemeta-research/jsonschema-benchmark/actions/runs/11041938995.

We'll eventually be working on adding more implementations and extending it to other dialects (we are starting with Draft 4 for the first paper and then taking it from there).

If there is any way this can eventually become part of Bowtie, that'd be amazing.

Julian commented 3 days ago

Ah great! Fixed the link, thanks! I haven't looked at them yet but it would sound like they definitely can/should be run as part of Bowtie's benchmark suite! We probably need some docs on how to add benchmarks to it, I'll come back with some details, but yeah I'm all for it, seems like great work.

jviotti commented 3 days ago

Awesome. Let's collaborate on it. I'm all in for somehow making it part of Bowtie.

jdesrosiers commented 3 days ago

Sorry for always being the anti-benchmark guy, but I don't think this suite of benchmarks as a whole is great. It's almost all the same thing: medium-sized very simple schemas. The vast majority of the suite is just testing the speed of type evaluation. It's fine to have that benchmark as medium-sized very simple schemas are probably the most common schemas out there, but there's little point of having a suite of almost entirely the same kind of schema.

However, there were two schemas that stood out as useful benchmarks to me. One was the krakend schema. It's interesting because it's a very large yet fairly simple schema, but also because it makes use of a lot of references, which could be slow in some implementations. The other interesting schema is yamllint because it describes a structure with a lot of non-trivial alternatives (using anyOf/oneOf). This might be a good candidate for a medium-complexity/medium-size schema benchmark.

jviotti commented 3 days ago

It depends on the goal of the benchmark. Most JSON Schema users will indeed write "medium-sized very simple schemas" without using many complex keywords (if any), so those schemas (taken from SchemaStore) are very representative (and hence useful to us).

Another goal (which you are pointing out) is to stress test a validator with complex schemas that might highlight interesting finds, but might be a lot less representative of what real users will run on production.

If we want good long-term coverage, we should do both: lots of "simple" representative schemas + a set of complex schemas that stress test corner cases. At least for my immediate use case, I care a lot more about speeding up the kind of simple schemas that users are writing right now.

jviotti commented 3 days ago

Also keep in mind the repository we have is still very work in progress, as we are just building the benchmark, adding implementations, etc. We want to add a lot more cases before publishing the paper. If you have any specific interesting ones to include, definitely let us know!

michaelmior commented 3 days ago

@jdesrosiers Note that these are actual schemas and documents that real users are using. The schemas weren't selected because they are simple, but because they are representative. That said, I would certainly agree that it would be nice to have more complex schemas as well to stress test implementations. But I think there is some value to having examples that correspond with real-world usage.

jdesrosiers commented 2 days ago

Let me take a step back. My problem with benchmarks is that every use-case is different and has different needs, but benchmarks don't tend to do a good job representing a variety of dimensions that people can match up to their use-case. The point I'm trying to make is that you don't have 20 benchmarks, you have the same benchmark 20 times. What I think would be more useful is if you had benchmarks representing small schemas, large schemas, simple schemas, complex schemas, other dimensions, and combinations of these dimensions. As a user, I know the typical size/complexity/whatever of my schemas and can match that with the appropriate benchmarks to decide what implementation to choose.

It's not really related to the point I was trying to make, but I'd like to push back on the claim that schemas from SchemaStore are representative. They are real schemas and that's great, but we have to consider what SchemaStore is. SchemaStore is used by IDEs and editors to provide completions and diagnostics for config files. This is not the primary use of JSON Schema, which means these schemas are full of workarounds to get the behaviors the schema author expects from the IDE/editor, which forces a particular style that might not be found in more typical uses of JSON Schema such as API request validation. These aren't schemas users consume directly and use with their chosen implementation. These schemas are used by tooling to provide non-standard functionality. So, you're making the assumption that schemas made for IDEs/editors are representative of schemas made for generic validators, which is what you're testing, not IDEs/editors.

michaelmior commented 2 days ago

@jdesrosiers I agree with a lot of your points although I'll note that not all of the schemas used come from JSON Schema Store. I don't think it's true that it's the same benchmark 20 times. I say that mostly because we did encounter very different performance issues with several of the schemas. I'm not saying we can't get better at representing even more diverse schemas though. We certainly can 😃

bowtie-json-schema / bowtie

Evaluate the JSON Toolkit Benchmarks #1543