brimdata / super

An analytics database that puts JSON and relational tables on equal footing
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 66 forks source link

document coercion/overflow in language.md #3706

Open mccanne opened 2 years ago

mccanne commented 2 years ago

I wrote this up in the draft zng-beta spec knowing it went elsewhere, so I'm now taking it out of the spec and putting it here for someone to find a home in the zql docs for this. We should also discuss and make sure these are the semantics we want.

## Appendix B. Recommended Type Coercion rules

> TBD: it might be better to put this in the ZQL docs

While outside the scope of the ZNG format specification, we include here
some suggested rules for converting types when mixed-type operations occur,
e.g., adding an uint32 field to an int32 fields or aggregating a stream of
mixed-type numeric values in a sum operator.

The age old question is "does unsigned(-1) equal unsigned(maxint) and thus
that signed -1 is larger than unsigned 1 in a coerced comparison?"
The standard SQL specification goes so far as to avoid unsigned types altogether
to avoid this confusion.  However, since unsigned types are prevalent
in the real world, and we want ZNG to be a reliable and complete language-independent
model for communicating structured data, ZNG embraces the unsigned type.

Given the dynamic typing nature of ZNG streams (e.g., x in one record might
be a uint8, in another an int64, and in still another, a string), type coercion
is important for ergonomic use, and implementations are thus encouraged
to handle mixed-type operations robustly.

For systems that perform analytics directly on ZNG, the following coercion
patterns are recommend for logical comparisons of numbers, arithmetic operations,
or streaming aggregations over numbers:
* For float32 and float64, the float32 is converted to float64.
* For float32 and any integer type, the integer is converted to float32
and any loss of precision causes no error.
* For float64 and any integer type, the integer is converted to float64
and any loss of precision causes no error.
* For integers of same signed-ness but different widths, the smaller width
type is converted to the wider type.
* For any signed and unsigned integers smaller than 64 bits, the unsigned value
is converted to the corresponding signed type if possible, and otherwise,
both are converted to the widest signed type that will allow conversion of
the unsigned value unless the unsigned value cannot be converted,
in which case an overflow error occurs.   e.g., uint8(255) and int8(-1), are
converted to int16(255) and int16(-1), but uint64(2^32) and any signed value
will result in overflow.
* For a time or duration with a number, automatic coercion is not performed
and casts or conversion functions should be used.
* For a string with number, automatic coercion is not performed
and casts or conversion functions should be used.

Also,
* numeric constants should be int64 or float64 unless cast, which means
comparisons with constants will generally be coerced to these types and results
of mathematical operations with constants will be promoted as well;
* times and durations may be added, resulting in a time;
* times may be subtracted, resulting in a duration; and,
* a "plus" operator applied to two strings, implies concatenation,
but a "plus" applied to a string and is a type mismatch and casts
or conversion functions should be used.

As a counterpoint, javascript does all sorts of type coercion in particular

"a" + "b" == "ab"

(as in many other languages) but it also does crazy things like

[] + 1 == "1"

but

{} + 1 == 1

@henridf says:

- but a "plus" applied to a string and is a type mismatch and casts
+ but a "plus" applied to a string and another type is a type mismatch and casts
philrz commented 2 years ago

FYI for whoever might pick this up, back in the day on our internal wiki there was a partial write-up on this topic at https://lookylabs.atlassian.net/wiki/spaces/PROD/pages/159678469/Type+Coercion. Most of the specifics are probably too ancient to be relevant at this point, but there might be a couple nuggets of wisdom buried in there.

I spotted #1408 that looked like a duplicate of this one, so I've consolidated the text from there into the description of this issue and closed that one.