Closed jvenant closed 4 months ago
Good catch. The problem is from DoubleOffHeapMutableDictionary.equalsValueAt()
which will always return false
when input value
is NaN
.
We can fix it by comparing the Double.doubleToLongBits()
(bytes level same), but I'm not sure if allowing NaN
can cause other problems (it might break the binary search as well). Another option is to prevent NaN
to be ingested (not allowing it as default value, and also replace it with default value when seeing it in the input data)
Maybe a quick fix would be to disable NaN.
But from my standpoint, to be able to use NaN would be a better solution as it is not semantically equivalent to Infinities, null or zero. It is a very common value used in mathematical and finance domains.
The only other place where we have issue with NaN so far is in the where filters. We have to do something like : cast(num_col as STRING) = 'NaN'
.
num_col = 'NaN'
, num_col != 'NaN'
run, but doesn't filter the records correctly
I think this bug can be very dangerous. Maybe waiting for a good NaN management, we could reject schemas in case on NaN default...?
I feel we should just reject NaN
for now. Dictionary binary search might also fail.
To reject it, we can add a check in FieldSpec.getDefaultNullValue()
, and in the record reader, replace NaN
with the default value. @jvenant Do you want to help contribute this?
cc @snleee
Not sure the PR actually fix the issue. It's more a protection than a fix
It is auto-closed by merging #11661. Reopened it since we haven't addressed the issue completely
MySQL: not allow NaN
, only allow null
Postgres: allow NaN
official documentation states, In most implementations of the “not-a-number” concept, NaN is not considered equal to any other numeric value (including NaN). In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.
Some similar issue with -0.0
and 0.0
(This would also mess up the binary search/sorting)
https://jonisalonen.com/2013/positive-and-negative-zeros-and-mysql/
To simplify the handling, as a short term solution, I'd propose:
NaN
and replace it with default null value when ingested-0.0
with 0.0
We may consider following PostgreSQL behavior as long term solution
@Jackie-Jiang sounds good to me.
@Jackie-Jiang since now also the second bugfix is merged: can this issue be closed or do we keep it open till PostgreSQL behavior
is implemented?
Hi,
It seems that using NaN as a default value for Double in schemas can lead to much more calls to the BaseOffHeapMutableDictionary.expand that required.
And because a multiplication factor is apply at each call. It can quickly try to allocate a huge amount of off heap memory I suspect a hazardous conversion of NaN Double to null somewhere that prevent the DoubleOffHeapMutableDictionary.compare to achieve its goal correctly I wasn't able to get much more details, but maybe be it will be enough to ring a bell to someone knowing this part of Pinot.