CESQL v1 review - Githubissues

Cali0707 commented 6 months ago

We are hoping to move the CESQL spec to v1 in the next few weeks. The spec can be found here. This issue is a tracker issue for all discussions related to any remaining changes that need to be made before we can move this spec to v1.

If you have any concerns/questions about the spec please leave them below and we can discuss them!

Once you have finished reviewing the spec, if you think it is ready to go please react to this issue with a :+1:

jskeet commented 6 months ago

I'll add these in one comment, but we can separate them out if necessary. This is from a fairly brief skim - I thought this would be a suitable first set to get more of a feeling for what's useful.

2.1: (Bug?) expression includes boolean-literal but not number-literal or string-literal. Perhaps it should just include literal to cover all three?

2.2: (Suggestion) Given that the CE type is called "Integer" rather than "Number", perhaps number-literal should actually be integer-literal? That would allow for later expansion to floating point numbers

2.2: (Bug?) number-literal appears to preclude negative numbers. That might be deliberate, expecting the use of unary-operator to handle it, but I suspect it would be cleaner to include the sign within the literal itself. Apart from anything else, there might be issues otherwise where -2147483648 has to be parsed as an invalid positive integer literal (it's out of range) combined with the unary - operator

2.2: (Validation?) 2.2 specifies a maximum length for value identifiers, but doesn't impose any restrictions on number-literal - which means I don't know whether the value of -2147483648 above is actually a bug or not. Is there an expectation that integer sub-expressions can be out of range, so long as a the result of evaluation is in range? As another example, should 20000000000 / 10 be evaluated without error because the result is within the range of Integer, even though 20000000000 itself is not? What about 2000000000 * 10? (Does that raise an error? Some math operations are defined in terms of returning a value and raising an error, but I don't see anything about bounds checks.)

2.3: (Could be added later) the lack of a unary + operator isn't a huge issue, but it might be considered for symmetry. (There are times where it could lead to most consistent formatting, for example.)

3.5: Should an engine allow pre-defined functions to be overloaded by user-defined ones? (Could I add my own ABS(x, y): String, String -> String function for example?)

3.5.2: How should ABS(-2147483648) behave?

3.6: (Clarity) "Operators MUST be evaluated in order" - what order? Left to right? The continuation of "where the parenthesized expressions have the highest priority over all the other operators" suggests that "1 + (5 + 2)" should possibly be evaluated as "5 + 2" followed by "1 + 7", whereas I'd expect left-to-right evaluation order still (so 1, then 5+2, then 1+7).

Cali0707 commented 6 months ago

@jskeet thanks so much for pointing these out! My thoughts on your points (in order) are:

2.1(Bug): This seems to be a bug in the spec - the actual Parser file has all three 2.2(Suggestion) This seems to be a reasonable change to me, let's do it! 2.2(Bug) Yeah I see your point, let me try and update the number-literal definition to allow either + or - in the front (and rename it to integer-literal) 2.2(Validation): I believe in the SDKs we do validate that the literals do not overflow the 32 bit allowed integer values. We should add some clarification to the spec that integer literals MUST be a valid 32 bit integer value, otherwise there will be a parse error 2.3 My vote is to try and keep the number of changes to a minimum for now, so since this isn't a breaking change we should leave it out for now 3.5 IMO, we shouldn't allow overloading of pre-defined functions (at least not for the same arity as the pre-defined function). We should add text indicating that would be an error. 3.5.2: I'm not sure what would make the most sense here, but my initial guess would be to return a math error? WDYT? 3.6: Yes you are correct this should be clarified. I agree that left to right with parenthesized expressions taking priority when encountered makes sense. So "1 + (5 + 2)" would be evaluated as 1, 5+2, 1+7.

I'll work on opening a PR for these changes by tomorrow. Thanks again for taking a look, and let me know if you find anything else!!

jskeet commented 6 months ago

On 2.2, I should probably have split the comment into two parts - you've answered about literals, but we probably need something for all math operations.

3.5.2: I think it should raise an error, but return the max value.

(Overall, the whole "return a value but raise an error" is a bit confusing as an approach - I can see the reasoning, but it feels odd to me.)

duglin commented 6 months ago

Comments:

Section 1

let's turn SQL standard (ISO 9075) into a link
"CESQL is a Total pure functional programming language", lowercase "T" on "Total"... I think
The output of a CESQL expression evaluation is always a boolean, an integer or a string, and it might include an error.. It's not 100% clear if the intent is to say "It might ALSO include an error with along with one of the 3 types" or whether an error is a 4th type of return value. Perhaps: "... is always a boolean, an integer or a string. As a secondary return value, the express MAY also include a set of errors". Or something like that. It could just be my reading of it though.
When used as a filter predicate, the expression output value is always cast to a boolean value. let's make this normative via ...output value MUST always be case to...
What other uses of this spec do we expect other than "filter predicate" ? That might help with my final concern at the end of this comment.

Section 2.2

MUST NOT be greater than 20 characters in length, this should probably be SHOULD NOT instead since that's what's in the CE spec
See https://github.com/cloudevents/spec/issues/1285 - this could influence the CESQL spec. Perhaps for now word it to say that identifiers MUST follow the same naming convention as CE and then if CE changes this spec will automatically change
minor thing, but since you use + for number-literal, you could use it on value-identifier ::= ( lowercase-char | digit ) ( lowercase-char | digit )* too, meaning value-identifier ::= ( lowercase-char | digit )+
Because string literals can be either '' or "" delimited, in the former case, the ' has to be escaped, while in the latter the " has to be escaped. - perhaps add something like to be used within the string-literal at the end. Just to make it clear why we're mentioning this.

Section 3.4.1

what would -STRING mean? Probably an error. I wonder if we should be explicit about that? Same for other tables too. For example what does "boolean >= boolean" map to?

Section 3.4.2

sometimes we say things like Returns true if x is strictly less than y - and use "strictly" but other times we don't. I'm not sure what "strictly" means in these cases. Can we be consistent on its usage, and if we keep it, define it? I'm not sure the word is needed.
is the word "truncated" clear to everyone?
The modulo and divisions MUST follow... - use of the word "modulo" is new in this sentence. I might be good to use it in the table on the % line to make sure there's a clear linkage. Or change this to The modulo(%) and divisions(/) operations MUST follow...

Section 3.5

and defines some built-in that every engine MUST implement., feels like it's missing a word, perhaps and defines some built-in ONES that every engine MUST implement.. ??
An engine SHOULD also allow users to define their custom functions. Perhaps: An engine SHOULD also allow users to define custom functions, however, the mechanism by which this is done is out of scope of this specification.. Removed the word "their" and added stuff at the end
@jskeet asked if people can override spec defined functions... I don't see why not as long as they don't break the rules defined (meaning ...MUST be able to distinguish between two functions defined with the same name but different arity.). I'm not a fan of spec authoring have special rights that extension authors do not - even though there are times it's necessary.
the example in here of valid and invalid overloads feels like it's missing a "baseline" ABC function to know what's already in the system to determine what's a valid overload or not. No?
When a function invocation cannot be dispatched, the return value is undefined.. Should it return false+error?

Section 3.5.1

on CONCAT_WS can we be more explicit and say that the delimiter is only used between each string, never before or after the leading/trailing strings?
Does TRIM remove control characters or just space/tab/lf/cr ? We may want to be a bit more precise here
on LEFT/RIGHT, when "y" is too small/large, I'm a bit surprised the return value is 'x' instead of an empty string. Why is that? SUBSTRING seems to do this.
pos > LENGTH(x) OR pos < -LENGTH(x) for SUBSTRING, I don't get the -LENGTH(x). Shouldn't it be pos < 0 ?

Section 3.7

This section defines an ambiguous operator/function as an operator/function that is overloaded with another operator/function definition with same symbol/name and arity but different parameter types. This seems odd to me since doesn't the spec ban this situation? If so, it seems we don't need to state whether things are ambiguous or not in each item in the list. It's an error case and we can just say in one spot what happens... it's an error. No?
if we do ^^^ then do we need that ordered list of steps? Just say "cast each parameter to the specified type defined by the function". No?
I think we need more details of how to map all combinations. E.g. bool->int, bool->string, string->int, string->bool, ....

I think each case where we say the return value is undefined, we should try really hard to pick a null/zero/false type of value to help interop - even if an error is generated.

The entire idea of returning an error feels a bit under-specified because nothing is said about what people can do with that info, or how they get/see it. I can't help but wonder if that's a carry-over from the days when this was a more generic processor and not just a CE filter expression evaluator. If this is really just for CE filters, then perhaps we should change things so that we can say that certain cases are errors but as of now the error is dropped (or perhaps logged), and the result will be xxx. That then leaves the door open for a future version of the spec to say what should happen with the errors. But to say that the error is "returned" along with the result feels like we're teasing the user/reader by implying they should have access to it (since it's "returned"), but they don't - at least not per the spec.

Cali0707 commented 6 months ago

@jskeet @duglin I've opened a PR to address these issues in the spec, if you could TAL when you have some time that would be great!

@duglin I haven't addressed your feedback here in that PR yet, I will try and push those changes soon...

Cali0707 commented 6 months ago

What other uses of this spec do we expect other than "filter predicate" ? That might help with my final concern at the end of this comment.

@duglin one other use case I have been exploring in Knative is allowing users to use CESQL to add new attributes to events based on other attributes as they are processed (for example based on an original event when sending a reply event to a trigger).

The entire idea of returning an error feels a bit under-specified because nothing is said about what people can do with that info, or how they get/see it. I can't help but wonder if that's a carry-over from the days when this was a more generic processor and not just a CE filter expression evaluator. If this is really just for CE filters, then perhaps we should change things so that we can say that certain cases are errors but as of now the error is dropped (or perhaps logged), and the result will be xxx. That then leaves the door open for a future version of the spec to say what should happen with the errors. But to say that the error is "returned" along with the result feels like we're teasing the user/reader by implying they should have access to it (since it's "returned"), but they don't - at least not per the spec.

One use case for the errors in the filtering context is that it makes for nice metrics reporting in terms of what actually went wrong. I agree we should specify what errors are possible - maybe we can leverage some of the info from the tck tests for that? In those, we have a clearly defined set of possible errors. I think we should just explain this more in the spec. WDYT?

duglin commented 6 months ago

Let's also make sure the spec is clear that string compares are done "case sensitively" .

Cali0707 commented 6 months ago

sometimes we say things like Returns true if x is strictly less than y - and use "strictly" but other times we don't. I'm not sure what "strictly" means in these cases. Can we be consistent on its usage, and if we keep it, define it? I'm not sure the word is needed.

Having looked in more detail, this different usages seem correct to me. Often times in a more mathematical sense, "x is strictly less than y" less than is used to imply x < y, wile "x is less than y" implies x <= y. In other words, the usage of "strictly" means that the value most be less than, and cannot be equal. But, I do agree that this can seem confusing/unclear without the necessary context. Maybe we should define it's usage or rephrase these?

is the word "truncated" clear to everyone?

Personally, it wasn't when I initially read the table of operations, but there is a link to it's definition right after the table. Maybe we should also include this link where we mention it in the table?

duglin commented 6 months ago

"x is less than y" implies x <= y - hmmm that's a bit hard swallow, especially when the "x <= y" line says "less than or equal to". Let's just drop the word "strict" and say "x is less than y" and "x is less than or equal to y", and avoid any unnecessary confusion.

re: "truncated" - often it's less confusing and easier to just inline references so people don't need to go looking for clarification. So, my suggestion is to just modify the cell that defines "%" and "/" to include the appropriate text all in one spot. E.g. "%" would be something like: "The remainder of x divided by y, where the result has the same sign as x. If y is zero, then the result is zero and raises an error". If we do that we can drop the 3 lines after the table all together.

jskeet commented 6 months ago

I agree with Doug here. I suspect it would be surprising to 99% of developers if you said "Yes, 5 is both less than 5 and greater than 5". If you get people to read out loud x < y I suspect most people would say "x less than y" rather than "x strictly less than y", and if you get people to read out loud x <= y I suspect most people would say "x less than or equal to y".

Cali0707 commented 6 months ago

on LEFT/RIGHT, when "y" is too small/large, I'm a bit surprised the return value is 'x' instead of an empty string. Why is that? SUBSTRING seems to do this.

This is to make the functions as compatible as possible with their normal behaviour in SQL. Normally (as far as I can tell), LEFT and RIGHT will behave this way, probably so that you can use them to put some upper bound on the length of a string, selecting characters starting at one end or the other

pos > LENGTH(x) OR pos < -LENGTH(x) for SUBSTRING, I don't get the -LENGTH(x). Shouldn't it be pos < 0

This should be clarified, the idea is that if the position is negative the beginning of the substring is the last character of the string

Cali0707 commented 6 months ago

This section defines an ambiguous operator/function as an operator/function that is overloaded with another operator/function definition with same symbol/name and arity but different parameter types. This seems odd to me since doesn't the spec ban this situation? If so, it seems we don't need to state whether things are ambiguous or not in each item in the list. It's an error case and we can just say in one spot what happens... it's an error. No?

I think that this is not possible for functions (due to the restrictions in the function overload section). However, it is possible for operators to be ambiguous so I think we should just narrow this section to only allow ambiguous operators.

markpeek commented 5 months ago

I looked over the current CESQL spec and it looks good to me. A minor nit is some uses of “that is” do not have a comma after it.

Example: A function can be variadic, that is the arity is not fixed. should it be: A function can be variadic, that is, the arity is not fixed.

Another case: CESQL allows overloading, that is the engine MUST be able...

But some uses aren’t using “that is” for “in other words” This section defines an ambiguous operator as an operator that is overloaded…

I'd suggest auditing "that is" in the spec to verify how it is used and correct if you agree with both leading and trailing commas.

duglin commented 5 months ago

@Cali0707 do we need this issue any more?

Cali0707 commented 5 months ago

No, I think we can close this

cloudevents / spec

CESQL v1 review #1282