brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

add parallel input option to zq #1616

Closed mccanne closed 3 years ago

mccanne commented 3 years ago

To run zq with a zql join, you need a way to specify the left file and the right file and input those files as separate input to a parallel input zql graph. We could add a "from" operator that refers generically to the source, but to keep the first steps simple, we will simply an option to zq to specify that the input files are to be used as inputs to a parallel zql query, e.g.,

zq -P '( sort ip.resp_h ; sort ip_key ) | join ip.resp_h=ip_key | ...' logs.zng intel.zng

This could be used generically beyond the scope of join, e.g., this would just intermix threat intel logs within an edge graph computed from the logs file:

zq -P '( count() by id.orig_h,id._resp_h; filter * ) | sort id.orig_h' logs.zng intel.zng

At most one of the input files may be stdin. It is an error for the number of files to not match the number of parallel inputs to the zql query.

The join proc will be added in a subsequent PR.

philrz commented 3 years ago

I ended up verifying this as part of the overall verification of join in #1629.