brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.38k stars 67 forks source link

Rework package sam/expr/coerce #5086

Closed mattnibs closed 6 months ago

mattnibs commented 6 months ago

Rework the coerce package so that it uses native values instead of their byte representation.

Fixes #4719

mattnibs commented 6 months ago

Shout out to @nwt for doing the heavy lifting on this pr, I merely updated it and helped get it across the finish line.

philrz commented 6 months ago

I just did some test with this branch at commit 5d59954 using the repro steps from #4719 and this confirmed that this branch seems to fully address what's covered in that issue. I've therefore marked that issue as "Fixed" by this PR.

philrz commented 6 months ago

We just discussed at a team meeting that I might add some automated tests based on the #4719 repro, so I'm happy to do that via a push to this branch or a separate PR if this merges first. In prep for that, I did some eyeballing of the output of this exhaustive run through math on all the numeric types:

$ for type1 in $(cat types.txt); do for type2 in $(cat types.txt); do   echo -n "$type1 $type2: " ; echo "{num1: 5($type1), num2: 4($type2)}" | zq -z "yield num1 - num2" -; done; done

On both tip of main (where this issue currently still exists) and this "rework" branch at commit 5d59954, here's a summary of what's changed:

$ diff -y main.out rework.out | grep "\|"
uint32 float16: 1.(float16)                 |       uint32 float16: 1.(float32)
uint64 float16: 1.(float16)                 |       uint64 float16: 1.
uint64 float32: 1.(float32)                 |       uint64 float32: 1.
int32 float16: 1.(float16)                  |       int32 float16: 1.(float32)
int64 float16: 1.(float16)                  |       int64 float16: 1.
int64 float32: 1.(float32)                  |       int64 float32: 1.
float16 uint32: 1.(float16)                 |       float16 uint32: 1.(float32)
float16 uint64: 1.(float16)                 |       float16 uint64: 1.
float16 int32: 1.(float16)                  |       float16 int32: 1.(float32)
float16 int64: 1.(float16)                  |       float16 int64: 1.
float16 float32: -Inf(float16)                  |       float16 float32: 1.(float32)
float16 float64: -Inf(float16)                  |       float16 float64: 1.
float32 uint64: 1.(float32)                 |       float32 uint64: 1.
float32 int64: 1.(float32)                  |       float32 int64: 1.
float32 float16: -17403.(float32)               |       float32 float16: 1.(float32)
float32 float64: -4616189618054758400.(float32)         |       float32 float64: 1.
float64 float16: -17403.                    |       float64 float16: 1.
float64 float32: -1082130427.                   |       float64 float32: 1.

Putting aside the fixes to the grossly incorrect math that appear at the bottom (which is all excellent improvement) I also can see that we're using the "bigger" types for math between other mixed types, e.g., the result of math between uint32 and float16 is now putting the result in a float32 instead of a float16 has it had before. This makes sense to me and I've eyeballed what's in the table above and they all similarly make sense to me. So the test I'll propose would effectively lock in the improved behavior as the new normal.