brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.38k stars 67 forks source link

Auto-casting of join keys #5071

Open philrz opened 6 months ago

philrz commented 6 months ago

tl;dr

A user suggested that Zed automatically cast join keys as necessary to increase the likelihood of a match without needing to explicitly cast values to comparable types.

Details

Repro is with Zed commit 38763f8.

A community zync user asked the following:

Now that we have automatic sorting on joins, it would also make sense to me to do automatic casting of the left key. I can't count the number of time I had to cast() the left key, and how many hours I spent figuring out how to get my joins working beacuse of that :slightly_smiling_face: + I am guessing zed probably knows what the cast should actually be, based on the right key type ?

They offered this specific example:

...when comparing time values with string dates (e.g., “2024-03-01”). In such cases, the string dates must be cast to the appropriate time type for meaningful comparisons.

Here's a repro of that. As the user described, this first attempt fails because the left key is a string while the right key is a time.

$ zq -version
Version: v1.14.0-16-g38763f82

$ cat datestr.zson 
{datestr: "2024-03-01", word: "hello"}

$ cat time.zson 
{timeval: 2024-03-01T00:00:00Z, word: "goodbye"}

$ cat join.zed 
file datestr.zson
| inner join (
  file time.zson
) on datestr=timeval otherword:=word

$ zq -I join.zed 
[no output]

However, we can force it to work if we cast the left key to time type before the join.

$ cat join-with-cast.zed 
file datestr.zson
| datestr:=time(datestr)
| inner join (
  file time.zson
) on datestr=timeval otherword:=word

$ zq -I join-with-cast.zed 
{datestr:2024-03-01T00:00:00Z,word:"hello",otherword:"goodbye"}

While reproducing this, I also noticed a bonus limitation that might be worth its own issue, but for now I'll just log it here: For now it seems the user is stuck doing the cast upstream in the pipeline, as if I try to do the casting right in the test for equality of the keys, that causes a syntax error.

$ cat join-other-cast.zed 
file datestr.zson
| inner join (
  file time.zson
) on time(datestr)=timeval otherword:=word

$ zq -I join-other-cast.zed 
zq: error parsing Zed in join-other-cast.zed at line 4, column 10:
) on time(datestr)=timeval otherword:=word
     === ^ ===