brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.38k stars 64 forks source link

Extra output when yielding literals as input data #5324

Open philrz opened 1 week ago

philrz commented 1 week ago

tl;dr

I can't explain why the first line of output is repeated twice here.

$ zq -z '
yield [{id:1},{id:2}]
| over this
| left join (
  yield {id:1,name:"a"}
) on id=id name'

{id:1,name:"a"}
{id:1,name:"a"}
{id:2}

Details

Repro is with Zed commit b05e70b. This issue was discovered via community Slack thread.

These variations both work as expected.

$ zq -version
Version: v1.18.0-18-gb05e70bd

$ cat names.zson
{id:1,name:"a"}

$ zq -z '
yield [{id:1},{id:2}]
| over this
| left join (
  file names.zson
) on id=id name'

{id:1,name:"a"}
{id:2}
$ cat input.zson 
{left: {id:1}}
{left: {id:2}}
{right: {id:1,name:"a"}}

$ cat input.zson | zq -z '                                                         
switch (
  case has(left) => yield left
  case has(right) => yield right
) | left join on id=id name' -

{id:1,name:"a"}
{id:2}

However, in the user's original program they happened to have the record that formed the right-hand input to the join specified via yielded record literal, and for some reason once we do that the line {id:1,name:"a"} is repeated in the output.

$ zq -z '
yield [{id:1},{id:2}]
| over this
| left join (
  yield {id:1,name:"a"}
) on id=id name'

{id:1,name:"a"}
{id:1,name:"a"}
{id:2}
philrz commented 1 week ago

We reviewed this one as a group and I now have an explanation for why it's happening. @mccanne pointed out that because the yield is inside of the join ( ), a yield of the the constant value {id:1,name:"a"} is triggered by each upstream value, i.e., once for the {id:1} and once for the {id:2}. A simple non-join example of the same effect:

$ echo '1 2 3' | zq -z 'yield "hi"' -
"hi"
"hi"
"hi"

By contrast, from and file are currently implemented to only provide input data from the referenced data source one time.

Since it's effectively working as designed, this might just be a motivation to discourgage sourcing input data this way since this side effect would probably elude many users. However, we plan to design some other join improvements in the near future, so for now I've added this one to the Epic #4081 so we can make sure to review it again when we're sitting down to look at the others. @mccanne also pointed out that at some point we'll likely enhance from to have a way for it to fire with each upstream input when desired (#4752) so this may also be worthy of considering as relates to that.