Open asfimport opened 4 years ago
Antoine Pitrou / @pitrou:
What is validate_categories
? Is it in PyArrow?
Joris Van den Bossche / @jorisvandenbossche: No, it's code from pandas, but the code itself is not that important here: it's just the fact that pandas requires categories (dictionary values) to be unique, and that python/numpy/pandas regards -0 and 0 as equal when it comes to use in hash-table / "unique" operations. And because pyarrow does not see -0 and 0 as equal in the dictionary encoding step, that gives a roundtrip problem with pandas.
Personally, I would update our hashing / encoding to also see -0 and 0 as equal values.
Joris Van den Bossche / @jorisvandenbossche: Another place where this comes up (which is not independent from the pandas roundtrip issue):
In [24]: pa.array([0., -0.]).unique()
Out[24]:
<pyarrow.lib.DoubleArray object at 0x7f633bb075e8>
[
0,
-0
]
Antoine Pitrou / @pitrou: If we treat 0 and -0 as equal, then the categorization will lose information. I believe Arrow has the right semantics here.
Antoine Pitrou / @pitrou: Also, I have to ask: how often do people actually use categoricals for floating-point values?
Joris Van den Bossche / @jorisvandenbossche:
If we treat 0 and -0 as equal, then the categorization will lose information
To be clear, we already treat 0 and -0 equal in other situations:
In [25]: a1 = pa.array([0., -0.])
In [26]: a2 = pa.array([-0., 0.])
In [27]: a1.equals(a2)
Out[27]: True
In [28]: import pyarrow.compute as pc
In [29]: pc.equal(a1, a2)
Out[29]:
<pyarrow.lib.BooleanArray object at 0x7f633bad7288>
[
true,
true
]
(of course that are other operations, so that doesn't mean we need to use the same semantics for encoding/unique).
I don't think many people use floating points values in dictionaries/categoricals. And personally, I don't care that much about the pandas conversion / python roundtrip in this case. It can perfectly be one of the exceptions on roundtrip (it's in the end pandas that is more strict as arrow in this case).
I think it is rather the underlying issue that this test case brought up that is interesting: should our hashing code regard 0 and \0 as equal or not? (since that impacts actual pyarrow functionality: dictionary encoding, unique, .., independent from arrow<>python conversions).
Now, I don't have a strong opinion on this last aspect, though. I was mainly pointing out that python/numpy/pandas do treat them as equal also in hash/unique contexts. But eg I checked with Julia, and they keep 0 and -0 as distinct values (while still evaluating them equal in ==
, i.e. the same behaviour as Arrow currently has).
Antoine Pitrou / @pitrou: Well, sure, they are equal, but they are not the same. Conversely, I would expect dictionary-encoding of NaNs to collapse all NaN as a single dictionary entry, even though they are not equal as per IEEE.
Joris Van den Bossche / @jorisvandenbossche:
Conversely, I would expect dictionary-encoding of NaNs to collapse all NaN as a single dictionary entry, even though they are not equal as per IEEE.
That's also what we do:
In [30]: pa.array([0., -0., np.nan, np.nan]).unique()
Out[30]:
<pyarrow.lib.DoubleArray object at 0x7f633bad63a8>
[
0,
-0,
nan
]
and eg Julia and pandas (regarding the NaNs) do the same (numpy not, though, they keep the distinct NaNs ..)
Hypothesis has discovered a corner case when converting a dictionary array with float values to a pandas series:
raises:
The arrow array looks like the following:
So we hash the negative and positive zeroes to different values so pandas/numpy is unable to convert it to a categorical series since the values as not unique:
Although
0.0
and-0.0
are different values they are considered equal according to the standard.Reporter: Krisztian Szucs / @kszucs
Note: This issue was originally created as ARROW-10211. Please see the migration documentation for further details.