apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.46k stars 3.52k forks source link

[Python] Storing negative and positive zeros in dictionary array #26214

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Hypothesis has discovered a corner case when converting a dictionary array with float values to a pandas series:


arr = pa.array([0., -0.], type=pa.dictionary(pa.int8(), pa.float32()))
arr.to_pandas()

raises:


categories = Float64Index([0.0, -0.0], dtype='float64'), fastpath = False

    @staticmethod
    def validate_categories(categories, fastpath: bool = False):
        """
        Validates that we have good categories

        Parameters
        ----------
        categories : array-like
        fastpath : bool
            Whether to skip nan and uniqueness checks

        Returns
        -------
        categories : Index
        """
        from pandas.core.indexes.base import Index

        if not fastpath and not is_list_like(categories):
            raise TypeError(
                f"Parameter 'categories' must be list-like, was {repr(categories)}"
            )
        elif not isinstance(categories, ABCIndexClass):
            categories = Index(categories, tupleize_cols=False)

        if not fastpath:

            if categories.hasnans:
                raise ValueError("Categorical categories cannot be null")

            if not categories.is_unique:
>               raise ValueError("Categorical categories must be unique")
E               ValueError: Categorical categories must be unique

The arrow array looks like the following:


-- dictionary:
  [
    0,
    -0
  ]
-- indices:
  [
    0,
    1
  ]

So we hash the negative and positive zeroes to different values so pandas/numpy is unable to convert it to a categorical series since the values as not unique:


In [2]: np.array(-0.) == np.array(0.)
Out[2]: True

In [3]: -0.0 == 0.0
Out[3]: True

In [4]: np.unique(np.array([0.0, -0.0]))
Out[4]: array([0.])

Although 0.0 and -0.0 are different values they are considered equal according to the standard.

Reporter: Krisztian Szucs / @kszucs

Note: This issue was originally created as ARROW-10211. Please see the migration documentation for further details.

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: What is validate_categories? Is it in PyArrow?

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: No, it's code from pandas, but the code itself is not that important here: it's just the fact that pandas requires categories (dictionary values) to be unique, and that python/numpy/pandas regards -0 and 0 as equal when it comes to use in hash-table / "unique" operations. And because pyarrow does not see -0 and 0 as equal in the dictionary encoding step, that gives a roundtrip problem with pandas.

Personally, I would update our hashing / encoding to also see -0 and 0 as equal values.

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: Another place where this comes up (which is not independent from the pandas roundtrip issue):


In [24]: pa.array([0., -0.]).unique()
Out[24]: 
<pyarrow.lib.DoubleArray object at 0x7f633bb075e8>
[
  0,
  -0
]
asfimport commented 4 years ago

Antoine Pitrou / @pitrou: If we treat 0 and -0 as equal, then the categorization will lose information. I believe Arrow has the right semantics here.

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: Also, I have to ask: how often do people actually use categoricals for floating-point values?

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche:

If we treat 0 and -0 as equal, then the categorization will lose information

To be clear, we already treat 0 and -0 equal in other situations:


In [25]: a1 = pa.array([0., -0.])

In [26]: a2 = pa.array([-0., 0.])

In [27]: a1.equals(a2)
Out[27]: True

In [28]: import pyarrow.compute as pc

In [29]: pc.equal(a1, a2)
Out[29]: 
<pyarrow.lib.BooleanArray object at 0x7f633bad7288>
[
  true,
  true
]

(of course that are other operations, so that doesn't mean we need to use the same semantics for encoding/unique).

I don't think many people use floating points values in dictionaries/categoricals. And personally, I don't care that much about the pandas conversion / python roundtrip in this case. It can perfectly be one of the exceptions on roundtrip (it's in the end pandas that is more strict as arrow in this case).
I think it is rather the underlying issue that this test case brought up that is interesting: should our hashing code regard 0 and \0 as equal or not? (since that impacts actual pyarrow functionality: dictionary encoding, unique, .., independent from arrow<>python conversions).

Now, I don't have a strong opinion on this last aspect, though. I was mainly pointing out that python/numpy/pandas do treat them as equal also in hash/unique contexts. But eg I checked with Julia, and they keep 0 and -0 as distinct values (while still evaluating them equal in ==, i.e. the same behaviour as Arrow currently has).

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: Well, sure, they are equal, but they are not the same. Conversely, I would expect dictionary-encoding of NaNs to collapse all NaN as a single dictionary entry, even though they are not equal as per IEEE.

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche:

Conversely, I would expect dictionary-encoding of NaNs to collapse all NaN as a single dictionary entry, even though they are not equal as per IEEE.

That's also what we do:


In [30]: pa.array([0., -0., np.nan, np.nan]).unique()
Out[30]: 
<pyarrow.lib.DoubleArray object at 0x7f633bad63a8>
[
  0,
  -0,
  nan
]

and eg Julia and pandas (regarding the NaNs) do the same (numpy not, though, they keep the distinct NaNs ..)