facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.52k stars 1.15k forks source link

Deterministic and order agnostic min/max computation that handles nested nulls. #7633

Open laithsakka opened 12 months ago

laithsakka commented 12 months ago

Description

One problem with Min/Max computation is that Presto will stop at the first seen null, and hence the final results could be dependent on the order of things being received.

The change in the order of inputs to a Min aggregation over an array can result in different outputs based on the order of inputs, for example, for the input arrays received in the order [1,2], [3, 4], [3, Null],[3, Null], the minimum is [1,2]. However, if the inputs received in this other order [3, Null], [3, Null], [1,2], [3, 4], the minimum is indeterminate because we cannot compare [3, Null] with [3, Null].

To solve this issue, we ended up not allowing nested nulls to ensure that the final results are deterministic since shuffles can change the order of inputs. For more information about this issue, see https://github.com/facebookincubator/velox/pull/6723.

To better solve this and still support nested nulls, I propose a new method for comparing values for the purpose of finding Min/Max that is deterministic across different input ordering.

Let's think about finding the min value as an aggregation where we consume every input in the incoming stream. Instead of stopping when a null is found, we would continue as follows:

abstract explanation

( you can read the example based first if your brain prefers that)

example based explanation

More concrete explanation with examples:

Proof: need to add it still if people like the idea.

laithsakka commented 12 months ago

cc @mbasmanova @duanmeng

laithsakka commented 12 months ago

This does not only solve the deterministic issue it also give more accurate results for min/max. computation instead of stopping at first seen null with failure. ex: I can find the min across [3, null], [3, null][1,2] why would presto fail? wont fail with this apraoch.

mbasmanova commented 12 months ago

@laithsakka This is very interesting idea. I think I understand the intuition behind this approach, but I don't fully understand the details. Will chat with your over VC.

mbasmanova commented 12 months ago

CC: @duanmeng

laithsakka commented 11 months ago

Another easier way to do this in deterministic way is :

1) assume null to be inf in case of max, or -inf in case of min functions.

2) compute the max2 or the min2 numbers, if we have have t compare inf with inf we pick either. Ex: max( [1, null], [1, null]) is [1, null] max( [1, null], [1, 2]) is [1, null]

3) the final output is indeterminate if comparing the two max or the two min items is indeterminate (require comparing nulls) otherwise its the max or ming among them.

ex input: [1,2], [3, 4], [3, null],[3, null]

min two = [1,2], [3, null] regardless of the order of comparison. since [1,2]<[3, null] is determinate answer is [1,2]

max two = [3,Null][3, Null] regardless of the order of comparison. [3,Null]<[3, Null] is indeterminate answer is indeterminate.

@duanmeng @mbasmanova

duanmeng commented 11 months ago

@laithsakka Looks great. IMHO we could implement this by modifying the compareNulls logic that returns 1 or -1 according to the CompareFlag::nullFirst flag or something else when comparing a null with a non-null and returning std::nullopt when comparing a null with a null.

duanmeng commented 11 months ago

But for [1,2], [2, 4], [3, null],[3, null], the result of min would be non-deterministic,