Open judahrand opened 1 month ago
take
After some reconsideration, @judahrand, why would this be an issue? An object being equal to itself seems reasonable, no matter the value stored inside?
An object being equal to itself seems reasonable, no matter the value stored inside?
I disagree. For an explicit .equals
method I would expect the result to be the same whether the Python object is the same or if the object is a copy. This is not the current behaviour:
>>> import pyarrow as pa
>>> table_1 = pa.Table.from_pydict({"foo": [0.1]})
>>> table_2 = pa.Table.from_pydict({"foo": [0.1]})
>>> table_1.equals(table_2)
True
>>> table_1 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_2 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_1.equals(table_2)
False
>>> table_1.equals(table_1)
True
It feels really weird and unexpected to me for .equals
to behave differently depending on the contents on the table. It seems like the current implementation (in pseudo code) is more like:
def equals(self, other):
if self is other:
return True
return self == other
I would expect the result to be the same whether the Python object is the same or if the object is a copy
I think I see where you're coming from, and I think you're correct as what is happening here is that (if my understanding of the equality function is correct):
def equals(self, Table other, bint check_metadata=False):
self._assert_cpu()
if other is None:
return False
cdef:
CTable* this_table = self.table
CTable* other_table = other.table
c_bool result
with nogil:
result = this_table.Equals(deref(other_table), check_metadata)
return result
In terms of C, since it points to itself, and by extension the same "NaN" data, it results to being equal. The copy previously notioned wasn't a copy of the pointer, but a copy by value, hence the failed comparison, as I think C would have just had whatever data was previously there, if it wasn't just filled with random data. If we were to instead do a copy by reference (i.e. making table_2 point to the same point of memory) it passes (but note this is just restating an equals to self).
>>> import pyarrow as pa
>>> table_1 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_2 = pa.Table.from_pydict({"foo": [float("nan")]})
>>> table_1.equals(table_2)
False
>>> table_2 = table_1
>>> table_1.equals(table_2)
True
I will note that when looking for similar situations I noticed some inconsistencies in how other libraries deal with such an edge case with NaNs:
>>> import numpy as np
>>> a1 = np.array([float("nan")])
>>> a2 = np.array([float("nan")])
>>> np.array_equal(a1, a2)
False
>>> np.array_equal(a1, a1)
False
and
>>> import pandas as pd
>>> table1 = pd.DataFrame([[float("nan")]])
>>> table2 = pd.DataFrame([[float("nan")]])
>>> table1.equals(table2)
True
>>> table1.equals(table1)
True
So this feels like something a pyarrow maintainer or the greater community need to decide on as the "correct" behaviour for the equality comparison.
Perhaps @kou if you could spare the time to explain potential thoughts on the matter?
I think that this is a bug:
diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
index 5dc5e4c1a9..44eab9902d 100644
--- a/cpp/src/arrow/table.cc
+++ b/cpp/src/arrow/table.cc
@@ -534,9 +534,6 @@ Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table>
}
bool Table::Equals(const Table& other, bool check_metadata) const {
- if (this == &other) {
- return true;
- }
if (!schema_->Equals(*other.schema(), check_metadata)) {
return false;
}
Describe the bug, including details regarding any error messages, version, and platform.
This seems... surprising to me. If the NaN comparison is the arrays always results in
False
then it is surprising that the first result isTrue
.Component(s)
Python