brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

initial mergejoin proc #1617

Closed mccanne closed 3 years ago

mccanne commented 3 years ago

This work involves adding an initial cut at a mergejoin proc that uses the -P parallel input feature of #1616.

This join will not be anything fancy and simply will join sorted inputs from a left parent and a right parent. It is up to the person writing the query to make sure the inputs are sorted by their join keys or the output is undefined. In future versions of join, we can allow the user to omit the necessary sorts and have query planner insert the sorts where needed.

The nice thing about this initial approach is that it re-uses all our existing machinery (in particular, spillable sorts) so that we can scale to arbitrarily large joins in a single node deployment (of course, at reduced performance due to spilling but with SSDs it won't be too bad).

This simple merge join would look like this...

zq -P '( sort ip.resp_h ; sort ip_key ) | join ip.resp_h=ip_key | ...' logs.zng intel.zng

or using stdin...

zq -P '( sort ip.resp_h ; sort ip_key ) | join ip.resp_h=ip_key | ...'  - intel.zng
philrz commented 3 years ago

Looks like a duplicate of #1629.