Open MatthewPinheiro opened 2 years ago
There is no way to fail fast with contains
keyword, as it has to validate items until it finds a valid one, and if none are valid it would validate all of them - there is no way around it, and it's independent of allErrors option. Without allErrors it might fail faster on each item - that is, it would not have to validate all properties of each item, but would fail on the first wrong one.
If performance is critical, you could tweak schema by using allOf keyword to control validation order and by putting cheaper and more likely to fail things in front, but it raises the question whether JSON Schema is a viable option for performance critical validation or you should simply write code instead of using JSON Schema as an imperative language...
How the errors are reported is a separate question, but whether it reports a set of errors per item, or one error per array - it has to validate all items.
Possibly I misunderstood and you wanted to short-circuit items inside array? Why can't you simply avoid using allErrors option? In general, allErrors option is useful for debugging/testing and helpful to generate errors in UI with ajv-errors package, but it should not be used in production servers, as it is very to DDoS them in case you are not short-circuiting on all malformed data. See this: https://ajv.js.org/security.html#security-risks-of-trusted-schemas
Possibly I misunderstood and you wanted to short-circuit items inside array? Why can't you simply avoid using allErrors option? In general, allErrors option is useful for debugging/testing and helpful to generate errors in UI with ajv-errors package, but it should not be used in production servers, as it is very to DDoS them in case you are not short-circuiting on all malformed data. See this: https://ajv.js.org/security.html#security-risks-of-trusted-schemas
You're correct, this is the functionality I'm asking for, and really to short-circuit validation at arbitrary levels in general. I'm well aware of the security risks of allErrors
, but I also have a strong need for it, one that I imagine others also have. I'm also working in an environment in which I can largely mitigate (and even where I can't, can still totally accept) those risks.
Like I'm sure many do, I'm using Ajv to ensure the data in our systems follow policy. Users regularly interact with these systems and may modify them in ways that are undesirable (e.g. they misspell an important field, forget a critical piece of info, accidentally turn off a certain flag, etc.). Because we don't control their frontends and often can't define policies within the system that are specific enough for our business needs, the best we can do is lock down access and separately report on if/when these systems are out of spec. At the same time, depending on what's being done and why, a single object as represented by an API response could have multiple things wrong with it.
In my original example, if I set allErrors
to false, the only error would be the lack of a user-42
object. This would be conveyed to the relevant people and presumably fixed, but the issue with the array length would go unnoticed until the next time my application runs. Even if this is done immediately after the previous fix, depending on how many errors the data has, it could be unreasonably long to constantly run, evaluate, fix, run again, etc. Worse, depending on how interdependent any failed subschemas are, it could be entirely confusing as to why the data is wrong. The latter is also partially why I'm using ajv-errors to write more human-readable errors.
Not that it necessarily matters either, but I'd also like to point out that schema-utils, part of and written by webpack, uses Ajv with allErrors: true
as a prod dependency, and I would assume other apps do too. IMO, most applications that evaluate unpredictable data/user-input need to be well aware of what that means regardless of the library they're using, and I think as the documentation alludes to, allErrors: true
as a DoS vector is often a magnification of an existing problem in your schema or code. If anything, the ability to arbitrarily short circuit would actually lessen that magnification in instances like the above.
to point out that schema-utils, part of and written by webpack, uses Ajv with allErrors: true as a prod dependency
webpack itself rarely used on production API servers, it is used as part of build process, with different risk/security profiles, so not sure this argument applies.
I think a particular problem here is that contains
validation cares about array level and not about item level, and therefore item level validation can be short-circuited even in case all-errors is used. This reasoning makes sense, and there are some other places where this logic already applies (I think).
I definitely do not want to introduce a generalisation for the sake of one scenario, without listing other scenarios where this generalisation could be helpful.
It's quite straightforward to make this change, but it's a breaking one so it should be in the next major version if it's a problem for a large number of users. Util then it's best to make it in a fork.
Let's keep this open to see if it has interest, and maybe you could think what other keywords this would be helpful for and how this generalisation would work...
Hello! Sorry for the delay on this, and thanks for keeping it open. I've been experimenting and thinking about this a lot, and I think I've gotten somewhere useful.
[...] and therefore item level validation can be short-circuited even in case all-errors is used. This reasoning makes sense, and there are some other places where this logic already applies (I think).
You're right, I found that keywords like 'if' already short-circuit themselves even when the Ajv instance is using allErrors
. So I created a couple custom keywords and found that arbitrary short-circuiting, as it turns out, was not actually that useful. However, I also created a short-circuiting version of contains (which I'll refer to as scContains
from here out) and that actually is useful. But while that solved my original example, it doesn't solve another closely related issue that I didn't realize I was also attempting to solve.
Let's say I want to use the array-checker
schema from my original example to validate some array of length n
, where every element of this array is already guaranteed to be an object with a unique id
property (a really, really common scenario for REST APIs, which tend to use databases with primary keys for storage). Let's also say that at index m
of this array is the object { "id": 42, "name": "Gary" }
. This is an extremely close match for our schema, but the name just barely misses, so validation will fail. But because we know that the id
property is unique, we don't even need to check to know that the remaining elements will also fail. As it is now though, contains
will check the remaining n - m
elements anyway, including making redundant checks on "name" even after having failed on "id". On the other hand, while scContains
would be an improvement in that it would prevent the redundant checks on "name", it would also waste cycles by continuing to search after index m
.
In other words, there's no good way to search for an array element based on a unique identifier, and then only worry about the correctness of that element once we know it's there.
There are some existing keyword combinations with contains/scContains
that partially get around this, such as with if/then/else
, items
, or discriminator
, but long story short, none of them are ideal in terms of accuracy, performance, verbosity of the schema, or error reporting. If interested in why, I can explain further, but I'll just skip to what did work.
What truly solved the issue for me was a second keyword which works in combination with scContains
, and which I think would be extremely useful for JSON validation in general. This keyword, which I'll refer to as whenContains
, accepts a JSON Schema, and only runs after successful validation of scContains
, similar to how if/then/else
operates. If scContains
is valid, the matching item (or items, if minContains
> 1) is run against the schema from whenContains
. This has been hugely advantageous to me, and here's what it looks like when I apply it to the original example (collapsed for brevity):
While contains
could be updated in a new version of Ajv to short-circuit as we discussed above, obviously whenContains
is non-standard. I do think it'd be a great addition to ajv-keywords, though, but there is a slight problem. In my own implementation of these keywords, I gave scContains
total control of when whenContains
should execute (i.e. the code generator for scContains
calls the whenContains
subschema if its keyword is present, just like how the code generator for if
is in charge of calling then
and else
). However, since it's ideally meant to replace contains
, obviously it shouldn't reference a non-standard keyword.
I did this because I couldn't figure out how to get contains/scContains
to expose the index it matched on, and because I couldn't reliably ensure it would run before whenContains
(without using the "before" property anyway, which would defeat the point of keeping it non-referential). I believe the former could be resolved pretty easy by adding a new property to the schema context or something, but I'm not sure how to go about the latter.
Alternatively though, both of the aforementioned custom keywords could just be added to ajv-keywords instead, perhaps with some better naming? Either way, if you agree that there's value in this, I would definitely be willing to contribute with a PR.
What you want reminds me discriminator
with the only difference that it discriminates among array items that satisfy some condition...
I honestly don't think these should be two separate keywords - they are coupled, it could be clearer to just have one keyword (although I would simply do it in code).
Maybe something like:
{
item: {
indexKey: "property_name",
indexValue: {schema}
value: {schema},
},
// uniqueItemIndex: ["property_name"] // defined in ajv-keywords
}
But it more and more feels like general purpose language in JSON (and one can go quite far with it) - so just code could be better.
First I have to say, regardless of whether or not I make a good case here, I want to give a huge thank you for your continued replies to my post and your ongoing work with this immensely useful library. It must be a ton of effort just keeping up with GitHub posts alone (especially when people like me write an essay every comment) but it's been really motivating to see and has really inspired me to contribute if/where I can! 👍
But it more and more feels like general purpose language in JSON (and one can go quite far with it) - so just code could be better.
While a code solution would indeed be faster, neither that nor a full-blown JSON scripting language would be better for me. I'm mostly mentioning performance as it highlights the intent of the keyword better, but speed isn't my main issue. I definitely understand the hesitance on anything that may blur the lines between JSON Schema and a general purpose scripting language, but you also pointed out that this reminds you of discriminator
already, so I think this is much closer to a logical extension of JSON Schema. However, this also accomplishes something that can't be with discriminator
or really any existing flow-control keyword combination.
discriminator
is useful for applying conditional validation based on known, static variations of a single property, but that's not what I'm doing here. In fact, I can't actually convert my above schema to use discriminator
because it doesn't support $ref
or numbers for the discriminated property. But let's assume I inline the subschemas and change my unique "id" to the string "42"
instead. Presumably we want to give the corresponding oneOf
two possibilities: Either an element's "id" is "42"
in which case it should look one way, or the element's "id" is not "42"
in which case it can look like anything. This is immediately a problem because the discriminated property needs to use either const
or enum
with static, fixed strings, so there's no way to define this second possibility. Even if this was possible, discriminator
being inside contains
would mean that the first element that didn't have an "id" of "42"
would also result in a match, which would be inaccurate. We could get around this by swapping out contains
for items
which would prevent the "42"
element from escaping validation, but this breaks the schema as it would pass an array without a "42"
element.
In addition, discriminator
doesn't support keying off of compound properties. For example, if there were two properties which only together made something unique, (e.g. "id" and "serverId" or something) there'd be no way to capture that with discriminator
. Using whenContains
, because I can supply any JSON Schema object to contains
, doing so is totally possible. Same goes for any dynamic condition, e.g. pattern
, not
, $data
, etc.
As for being a single keyword or not, I think either is fine. However, the dependency is one way as contains/scContains
can exist with or without whenContains
, similar to other existing keyword combinations.
EDIT: I think a single keyword is okay, especially if it's not possible to using it with vanilla contains
, but I think keeping them separate highlights intent better. Regardless of dependency, contains
and whenContains
are each responsible for a totally different validation, each with their own errors, and I think that's more inline with all the existing keywords (e.g. items
and prefixItems
): Even where there are dependencies and the existence of one will affect if/how another executes, each keyword is still responsible for its own set of errors and does not imply failure or success of the other. Similarly, contains
is for validating the array itself, and whenContains
is for validating an element of the array. That's fine if it has to be done in one keyword, but at this point I'm just really trying to highlight what space this fills.
What version of Ajv you are you using?
8.6.3
What problem do you want to solve?
This is a feature/change request, but any advice on how to better handle this problem with what's already there would be greatly appreciated as well. Either way, say I have the following schemas and data:
schemas
invalid data
When I run
array-checker
against the above data with theallErrors
option, I get the following error messages:I'd argue that in terms of message accuracy, this doesn't capture my intent with using
contains
, and I'd imagine it doesn't for most use cases of it either. I'm trying to check two things about this array: That it has at least 10 items, and that at least one of those items looks a certain way, hence the need forallErrors
. However, this labels the array elements as incorrect, similar to howitems
would, even though that's not really the case. None of these elements are incorrect, it's the array that's incorrect.Additionally, if
user-42
was a substantially larger schema, this would burn a lot more CPU cycles. So far, I've tried getting around this through a combination ofif/then/else
and the ajv-errors package, but these both have their issues.if/then/else
makes the schema a little harder to understand and adds its own redundant error messages, and ajv-errors has its own limitations. I've also tried manually omitting errors underneath contains based on theschemaPath
, but I think as mentioned in https://github.com/ajv-validator/ajv/issues/1662, due to the use of$ref
the full path isn't provided.What do you think is the correct solution to problem?
The first thing I can think of is allowing for another value in the
allErrors
option. This value could be a string such as"contains"
, which would be equivalent toallErrors: true
in all cases except when using thecontains
keyword, at which point any JSON Schema underneath would be equivalent toallErrors: false
. Presumably this also wouldn't be a a breaking change as it's a new option value.While I think this solves the issue in a lot of cases (and in cases where it doesn't, it would still simplify certain workarounds), I think it's just symptomatic of a larger issue, which is the inflexibility of allErrors. Sometimes I want fail-fast, and sometimes I don't. In theory, flexibility could be achieved with a new keyword, potentially one part of ajv-keywords, and would allow
allErrors
to be overridden for any schemas it's placed in. However, I'm not sure theaddKeyword
method gives enough control to allow this.Or, and perhaps this is just wishful thinking, would there be value in petitioning the JSON Schema spec to reserve a meta keyword just for implementation-specific options? e.g.
$implementation
could be an object of whatever key-value pairs you want. It would then be up to the implementation to choose how this object is used, such as allowingallErrors
to be overridden at different levels. Would probably be do a lot for flexibility, if feasible.Will you be able to implement it?
Not likely, but would be open to discuss