Open laithsakka opened 12 months ago
cc @mbasmanova @duanmeng
This does not only solve the deterministic issue it also give more accurate results for min/max. computation instead of stopping at first seen null with failure. ex: I can find the min across [3, null], [3, null][1,2] why would presto fail? wont fail with this apraoch.
@laithsakka This is very interesting idea. I think I understand the intuition behind this approach, but I don't fully understand the details. Will chat with your over VC.
CC: @duanmeng
Another easier way to do this in deterministic way is :
1) assume null to be inf in case of max, or -inf in case of min functions.
2) compute the max2 or the min2 numbers, if we have have t compare inf with inf we pick either. Ex: max( [1, null], [1, null]) is [1, null] max( [1, null], [1, 2]) is [1, null]
3) the final output is indeterminate if comparing the two max or the two min items is indeterminate (require comparing nulls) otherwise its the max or ming among them.
ex input: [1,2], [3, 4], [3, null],[3, null]
min two = [1,2], [3, null] regardless of the order of comparison. since [1,2]<[3, null] is determinate answer is [1,2]
max two = [3,Null][3, Null] regardless of the order of comparison. [3,Null]<[3, Null] is indeterminate answer is indeterminate.
@duanmeng @mbasmanova
@laithsakka Looks great. IMHO we could implement this by modifying the compareNulls
logic that returns 1 or -1 according to the CompareFlag::nullFirst
flag or something else when comparing a null with a non-null and returning std::nullopt
when comparing a null with a null.
But for [1,2], [2, 4], [3, null],[3, null]
, the result of min would be non-deterministic,
([1, 2], [3, null])
and ([2, 4], [3, null])
([1, 2], [2, 4])
and ([3, null],[3, null])
that the processing of the second stream would throw null vs null exceptions.
Description
One problem with Min/Max computation is that Presto will stop at the first seen null, and hence the final results could be dependent on the order of things being received.
The change in the order of inputs to a Min aggregation over an array can result in different outputs based on the order of inputs, for example, for the input arrays received in the order [1,2], [3, 4], [3, Null],[3, Null], the minimum is [1,2]. However, if the inputs received in this other order [3, Null], [3, Null], [1,2], [3, 4], the minimum is indeterminate because we cannot compare [3, Null] with [3, Null].
To solve this issue, we ended up not allowing nested nulls to ensure that the final results are deterministic since shuffles can change the order of inputs. For more information about this issue, see https://github.com/facebookincubator/velox/pull/6723.
To better solve this and still support nested nulls, I propose a new method for comparing values for the purpose of finding Min/Max that is deterministic across different input ordering.
Let's think about finding the min value as an aggregation where we consume every input in the incoming stream. Instead of stopping when a null is found, we would continue as follows:
abstract explanation
( you can read the example based first if your brain prefers that)
example based explanation
More concrete explanation with examples:
Proof: need to add it still if people like the idea.