brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Passing batches of input values to a lateral subquery #5348

Open philrz opened 1 week ago

philrz commented 1 week ago

tl;dr

A proposed new batch operator:

batch 2 => (
  collect(this) |  // do something with 2-element array
)

Details

Repro is with super commit 0af34e0.

A community user recently asked about this in the same Slack thread that spawned this comment. In their original question they were attempting to operate on just two back-to-back input values, but then widened their inquiry to the more general case of needing to handle an indefinite number of back-to-back input vales of a similar nature.

Let's use the example of 6 incrementing numbers.

$ seq 6
1
2
3
4
5
6

Using what's in the language currently, much like the response to the user's original question, it seems this would require some initial aggregation of the input pairs into a complex value and then picking apart the complex values, potentially using over and a lateral subquery. For instance if we wanted to add each pairing of numbers together:

$ super -version
Version: v1.18.0-35-g0af34e0f

$ seq 6 | zq -z '
yield {i:(count()+1)/2,v:this}
| values:=collect(v) by i
| sort i
| over values => (
  sum(this)
)' -

3
7
11

However, @mccanne pointed out that for a use case like this a purpose-built operator like batch proposed above would provide a more direct and intuitive way to get the values into the lateral subquery.

philrz commented 6 days ago

For comparison, here's a SQL approach to get the same result. I confess I had ChatGPT come up with this, so maybe there's a better way.

$ duckdb --version
v1.1.2 f680b7d08f

$ seq 6 | duckdb -c "                          
CREATE TEMP TABLE NumberedRows AS
    SELECT 
        column0 AS value,
        ROW_NUMBER() OVER () AS rn
    FROM 
        read_csv('/dev/stdin');

SELECT 
    (n1.value + n2.value) AS sum_value
FROM 
    NumberedRows n1
JOIN 
    NumberedRows n2 ON n1.rn = n2.rn - 1
WHERE 
    n1.rn % 2 = 1;"
┌───────────┐
│ sum_value │
│   int64   │
├───────────┤
│         3 │
│         7 │
│        11 │
└───────────┘