adobe / json-formula

Query language for JSON documents
http://opensource.adobe.com/json-formula/
Apache License 2.0
19 stars 8 forks source link

Unexpected type coercion in `value` #146

Closed Eswcvlad closed 4 months ago

Eswcvlad commented 5 months ago

From reading the spec, for value I assumed the following.

This is different from what I see in the implementation.


  1. value(`false`, 0)
  2. value(42, 0)
  3. value("abc", 0)

These cases throw a TypeError, even though first argument can be converted to array.

Same behavior, if both arguments need conversion, as it already fails on the first one.


  1. value([1, 2, 3], `false`)
  2. value([1, 2, 3], `true`)
  3. value([1, 2], "sdfgesrg")

In these cases null is returned instead of an element. There are "Failed to convert X to number" messages in the debug log, even though second argument has a valid integer conversion.

JohnBrinkman commented 5 months ago

Hmm. I think you're sending us down a slippery slope here... I'm not sure what the right answer is, but some observations:

Some options to consider:

  1. Add a rule that says: When a function parameter can have multiple types, we do not attempt any type coercion. The value must be an exact match to one of the types.
  2. Stop coercing values to arrays. I think this rule causes many of the issues you've identified.
  3. Add specific rules about how we'd coerce function parameters when there are multiple options

Are there other alternatives? I think I'd be happiest with option 1

Eswcvlad commented 5 months ago

When a parameter can be one of multiple types, you've inferred a rule that's not stated -- that we'd attempt to coerce to each type. Not an unreasonable assumption -- but also not explicitly stated

Yes, that's where a lot of issues I had came from. It is pretty reasonable to assume, that allowing more types for an argument won't disable implicit type coercion. But there is no explicit ruling on how to handle it...

I've tried coming with some scheme as "try type coercion in the order of types in the sum type definition and use the first one compatible", which, usually, came to the same result as the reference implementation, but it is rather insane tbf.

There's another inferred rule -- that the coercion of a second parameter is influenced by the resolved type of the first parameter. Again, not unreasonable -- but also not explicitly stated.

I tried to avoid that, as this often opens an unpleasant can of worms, but in this case I assumed this was an explicit description for this: a named child for an object or an integer offset for an array. I.e. if we have an array, use integer; if object, use string. And in this case it still reasonably works, as those can be coerced into each other rather effortlessly.

We need to look at other cases where parameters can have multiple types. e.g. the first parameter to hasProperty() can be one of: string | array | object. What happens when we receive a number? It can be coerced to either string or array. We'd need a rule for that case...

Yes, I have all those cases written and wanted to bring them up. In some cases it can be converted to multiple, which causes issues. I'll do a separate issue with all of them, so that we could look at them as a whole.

I don't think that expressions like value(`false`, 0) or value([1, 2, 3], `false`) make sense -- even with coercion. So I don't really mind if they fail.

I agree, that this looks absurd, since it is :) . But you will have those cases occur "naturally", when arguments are not literals, but complex expressions or lousy-typed data. Which in turn could cause those annoying bug reports, where you know the expression doesn't make sense, but it works in some library, so you might as well behave the same (queue unpleasant memories of working on .zip support :) ).

I would even argue a case like value(x, `false`) is not that absurd. You can have it appear, when you kind of have an if expression stored as a two element array and use a boolean expression to index it. Ex. data to send, based on a check-box state.

Add a rule that says: When a function parameter can have multiple types, we do not attempt any type coercion. The value must be an exact match to one of the types.

Seems like a fine solution. Though, imo, it is a bit unexpected to have implicit type coercion be turned off like that. Also this might cause backwards-compatibility concerns, if we were at some point "upgrade" a single type argument to a sum-type, which will cause corner cases to pop back up, and they might be rather odd to write them up in spec.

Stop coercing values to arrays. I think this rule causes many of the issues you've identified.

I don't think it will solve all of them, but it would remove a lot of somewhat pointless cases I've encountered. It is very easy to "cast" to an array and there is always toArray. This will also fix the inconsistency with bracket expressions, which strictly expect an array to work already. In some way rule is not that consistent with itself, as object -> array conversion is not allowed (though I think I saw a function, which did that anyway), while it is not that different from other cases...

Also, off the top of my head, I cannot think of an existing language, that does such conversions implicitly, except for the "string is an array of code points" case.

Overall, this would be a 👍 from me.

Add specific rules about how we'd coerce function parameters when there are multiple options Are there other alternatives?

I was trying to figure out such a rule and have exceptions written in the function descriptions themselves, but the more I though about it, the more convoluted it became to handle everything...

In the end, I think the easiest way would be to not think about it as type coercion anymore, but interpret all those cases as functions accepting any and the types specified is what the function will try to convert to internally, which can throw TypeError. I.e. make the callee responsible for type coercion instead of the caller.

My rationale is that the majority of cases, where this pops-up, are functions, which would often be separate, but are hiding under one name for convenience. You could say that we are, practically speaking, calling a type-erasing proxy function, which forwards the call to internal functions based on the types provided, which is documented in the function description.

Ex. contains might have been String::contains and Array::contains, which don't really overlap in their logic per-se. We also have a case like deepScan, where second argument's type depends on where it is currently iterating, which also fits this case better. Or value, where there are (object, string) and (array, integer) functions hiding inside, but with how it is now you can interpret as it having all 4 variations.

Also, if we were to remove the array coercion rule, some of this will become easier as well (like the contains case).

What do you think?

Btw, sorry for spamming issues this much. Writing in C++ again brought the "language lawyer" thinking back with it.

JohnBrinkman commented 5 months ago

Stop coercing values to arrays

Thinking on this some more, I'm worried that we haven't thought through all the cases where it's useful. e.g.:

And, while stopping array coercion fixes many problems, it doesn't fix them all. So by itself it's not a solution, it would have to be combined with some other rule change.

When a function parameter can have multiple types, we do not attempt any type coercion

it is a bit unexpected to have implicit type coercion be turned off like that.

Perhaps. But I'd argue that an unpredictable coercion is worse

Also this might cause backwards-compatibility concerns, if we were at some point "upgrade" a single type argument to a sum-type, which will cause corner cases to pop back up

But this is true in general -- we can't change any parameter types in a backward-compatible way. If we really wanted to change parameter types we'd have to introduce new functions instead.

I think the easiest way would be ... make the callee responsible for type coercion instead of the caller.

Well, I wouldn't call this the easiest solution. It would be a large change to the JS implementation and spec. IMO Having type coercion centralized is more consistent than having function-specific coercion rules.

But I think you are correct in that we can't create rules that allow us to do all coercions before the function call.

Or value(), where there are (object, string) and (array, integer) ... how it is now you can interpret as it having all 4 variations.

Well, actually... the way it is now there are many more than 4 variations -- given the various coercions of different types to arrays, strings and integers. But if we stopped coercing in cases where there are multiple parameter types, then we would narrow it down to 4 variations. And in those cases, I'm ok with the idea that we do not try to resolve this before the function call, but leave the individual functions define the rules on how to handle the permutations -- and apply coercions internally.

I guess there is yet another option: stop allowing multiple parameter types. In each case, create multiple functions. e.g. Instead of value() we could have objectValue() and arrayValue(). (I don't mention this because I like the option -- I just bring it up for completeness).

I think my current leaning is:

Eswcvlad commented 5 months ago

Thinking on this some more, I'm worried that we haven't thought through all the cases where it's useful. e.g.:

The min/max functions accept array parameters. e.g. max([1,2,3], [2,3,4]). But with implicit coercion we can also call: max(1,2,3,2,3,4).

I wanted to bring this up, but for safety you would still want to write max([a,b,c]) instead of max(a, b, c), as the latter will skip null, which may lead to error. I also personally like max(a, b, c) more, but null throws a curve ball.

The union operator. Currently we can say: [1,2,3] ~ 4 and it will implicitly turn that into a union of two arrays: [1,2,3] ~ [4]

I also was thinking of this... But my thinking was that adding two brackets doesn't make it much more messy, and also that in Python, where something like [1,2,3] + [4] is allowed, [1,2,3] + 4 is still a type error.

But I get your point, yes.

Perhaps. But I'd argue that an unpredictable coercion is worse

True

But this is true in general -- we can't change any parameter types in a backward-compatible way. If we really wanted to change parameter types we'd have to introduce new functions instead.

I was thinking of cases, when it would have been a TypeError before. For example, if power working on arrays would be reintroduced. Though in such case you would still need to explicitly document null, bool and string either way because of ambiguity, so it might not be important anyway...

Well, I wouldn't call this the easiest solution. It would be a large change to the JS implementation and spec. IMO Having type coercion centralized is more consistent than having function-specific coercion rules.

I assumed in the code it would mostly boil down to changing types to ANY in the signature internally and making sure cast are present in the body. But I haven't looked through the code enough to know, so I will trust you on that. :)

I guess there is yet another option: stop allowing multiple parameter types. In each case, create multiple functions. e.g. Instead of value() we could have objectValue() and arrayValue(). (I don't mention this because I like the option -- I just bring it up for completeness).

Yeah, this would look pretty out-of-place and unnecessarily verbose for such a language.

I think my current leaning is: leave array coercion alone stop coercing when there are multiple parameter types. When there are multiple parameters and resulting permutations, have individual functions define (and implement) the rules for the combinations.

To be fair, we always have !!x, 0+x (unary plus would be useful here), ""&x, `[]`~x to force type coercion, if really necessary. So this might be the easier way to go.

I think we would also need to add null explicitly in such cases, as this one would be pretty common.

JohnBrinkman commented 5 months ago

Summary:

  1. Some functions allow parameters with multiple types (e.g. left() allows a string or array). In these cases coercion can be ambiguous. e.g. a number can be coerced to either a string or array.

  2. Some functions allow combinations of parameters with multiple types. The interpretation of the second parameter is often dependent on the type of the first. e.g. contains() allows: {array|string} subject The element to be searched and {string|boolean|number|null} search element to find. If the first parameter is a string, the second must also be a string. If the first is an array, the second can be any of: string|boolean|number|null

Recommendation

JohnBrinkman commented 4 months ago

agreed

Eswcvlad commented 4 months ago

@JohnBrinkman Is there a mention of "Stop coercing when there are multiple parameter types. The provided parameter must be an exact match to one of the allowed types" in the spec now? I cannot find it...

JohnBrinkman commented 4 months ago

You're right. That part is missing.