apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.46k stars 3.52k forks source link

[Python] `pyarrow.Table` equality comparison behavior is unexpected #43985

Open judahrand opened 1 month ago

judahrand commented 1 month ago

Describe the bug, including details regarding any error messages, version, and platform.

>>> table = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table.equals(table)
True
>>> table_1 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_2 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_1.equals(table_2)
False

This seems... surprising to me. If the NaN comparison is the arrays always results in False then it is surprising that the first result is True.

Component(s)

Python

Yimche commented 1 week ago

take

Yimche commented 2 days ago

After some reconsideration, @judahrand, why would this be an issue? An object being equal to itself seems reasonable, no matter the value stored inside?

judahrand commented 2 days ago

An object being equal to itself seems reasonable, no matter the value stored inside?

I disagree. For an explicit .equals method I would expect the result to be the same whether the Python object is the same or if the object is a copy. This is not the current behaviour:

>>> import pyarrow as pa
>>> table_1 = pa.Table.from_pydict({"foo": [0.1]})
>>> table_2 = pa.Table.from_pydict({"foo": [0.1]})
>>> table_1.equals(table_2)
True
>>> table_1 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_2 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_1.equals(table_2)
False
>>> table_1.equals(table_1)
True

It feels really weird and unexpected to me for .equals to behave differently depending on the contents on the table. It seems like the current implementation (in pseudo code) is more like:

def equals(self, other):
    if self is other:
        return True
    return self == other
Yimche commented 2 days ago

I would expect the result to be the same whether the Python object is the same or if the object is a copy

I think I see where you're coming from, and I think you're correct as what is happening here is that (if my understanding of the equality function is correct):

    def equals(self, Table other, bint check_metadata=False):
        self._assert_cpu()
        if other is None:
            return False

        cdef:
            CTable* this_table = self.table
            CTable* other_table = other.table
            c_bool result

        with nogil:
            result = this_table.Equals(deref(other_table), check_metadata)

        return result

In terms of C, since it points to itself, and by extension the same "NaN" data, it results to being equal. The copy previously notioned wasn't a copy of the pointer, but a copy by value, hence the failed comparison, as I think C would have just had whatever data was previously there, if it wasn't just filled with random data. If we were to instead do a copy by reference (i.e. making table_2 point to the same point of memory) it passes (but note this is just restating an equals to self).

>>> import pyarrow as pa
>>> table_1 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_2 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_1.equals(table_2)
False
>>> table_2 = table_1
>>> table_1.equals(table_2)
True

I will note that when looking for similar situations I noticed some inconsistencies in how other libraries deal with such an edge case with NaNs:

>>> import numpy as np
>>> a1 = np.array([float("nan")])
>>> a2 = np.array([float("nan")])
>>> np.array_equal(a1, a2)
False
>>> np.array_equal(a1, a1)
False

and

>>> import pandas as pd
>>> table1 = pd.DataFrame([[float("nan")]])
>>> table2 = pd.DataFrame([[float("nan")]])
>>> table1.equals(table2)
True
>>> table1.equals(table1)
True

So this feels like something a pyarrow maintainer or the greater community need to decide on as the "correct" behaviour for the equality comparison.

Yimche commented 2 days ago

Perhaps @kou if you could spare the time to explain potential thoughts on the matter?

kou commented 2 days ago

I think that this is a bug:

diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
index 5dc5e4c1a9..44eab9902d 100644
--- a/cpp/src/arrow/table.cc
+++ b/cpp/src/arrow/table.cc
@@ -534,9 +534,6 @@ Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table>
 }

 bool Table::Equals(const Table& other, bool check_metadata) const {
-  if (this == &other) {
-    return true;
-  }
   if (!schema_->Equals(*other.schema(), check_metadata)) {
     return false;
   }