brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

collect() and union() skipping null values #5039

Open philrz opened 8 months ago

philrz commented 8 months ago

Repro is with Zed commit 4dc6236.

While helping a community Slack user I bumped into the following surprising result:

$ zq -version
Version: v1.14.0-2-g4dc62369

$ echo '1 2 null 4' | zq -z 'collect(this)' -
[1,2,4]

I would have expected the null value to still be in the array created.

Presumably due to the same effect, here collect() is not even returning an array.

$ echo 'null null' | zq -z 'collect(this)' -
null

union() has similar behaviors.

$ echo '1 2 null 4' | zq -z 'union(this)' -
|[1,2,4]|

$ echo 'null null' | zq -z 'union(this)' -
null

If we decide the current behavior is intentional and something we intend to keep, I'd suggest we start mentioning it in the docs with an explanation.

philrz commented 8 months ago

5041 was opened to change the behavior so collect() would preserve these values, but then we discussed it as a team and have decided to hold off on merging anything yet. Some things that were touched on in the discussion:

  1. There was consensus that if we change this for collect() we should do the same for union() so they're consistent.

  2. One reason for hesitation on merging #5041 is that this change is significant enough that it might break some programs, e.g., those of a particular community zync user that has some particularly large/complex Zed that relies heavily on collect() at times. If Zed were more mature it would probably be appropriate to do a careful/slow transition, e.g., maintain the current behavior with some kind of flag that becomes a new default, then surface a warning to users whose programs are relying on that behavior because we intend to swap the default in a future version, since this would give them time to understand and react to the change. Instead we discussed helping them to audit their programs for possible exposure and also giving a heads-up to the wider community via Slack before switching the behavior. But then we identified other behaviors with aggregate functions we might want to consider, so we concluded that we should gather up our thoughts on all those changes and then figure out a transition plan.

  3. As an example of "other behaviors", @mccanne cited avg() as an example, since here null values are currently treated as "not present", whereas if other aggregate functions like collect() and union() are changed to start preserving the null values, users might now expect a different behavior from avg() too, e.g., treating them as "present" and implying some meaningful value such as 0, which would change the result.

$ zq -version
Version: v1.14.0-2-g4dc62369

$ echo '1 3 5 7' | zq -z 'avg(this)' -
4.

$ echo '1 null 3 5 7' | zq -z 'avg(this)' -
4.
  1. @mattnibs pointed out that if we did toggle the behavior, it would be easy for users to get back the current behavior using the where clause, such as can be seen with his branch from #5041:
$ zq -version
Version: v1.14.0-3-ga33df054

$ echo '1 2 null 4' | zq -z 'collect(this)' -
[1,2,null,4]([(int64,null)])

$ echo '1 2 null 4' | zq -z 'collect(this) where this != null' -
[1,2,4]