apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.02k stars 1.14k forks source link

Change Expr `PartialOrd` to not rely on comparing hash values #8932

Closed alamb closed 1 week ago

alamb commented 8 months ago

Describe the bug

As described on https://github.com/apache/arrow-datafusion/pull/8908, the PartialOrd implementation of Expr is based on comparing the hash values of the two nodes

This is problematic because there can be different orderings for the same expression:

  1. Different releases of ahash
  2. Different platforms (as we saw with https://github.com/apache/arrow-datafusion/pull/8908 / 35.0.0 release verification

Here is the current implementation

https://github.com/apache/arrow-datafusion/blob/b7e13a0af711477ad41450566c14430089edd3f2/datafusion/expr/src/expr.rs#L850-L862

To Reproduce

No response

Expected behavior

I expect that PartialOrd of Expr will not change -- we probably have to implement a real PartialOrd implementation that compares each field (or maybe derive)

Additional context

No response

alamb commented 8 months ago

@tustvold notes on https://github.com/apache/arrow-datafusion/pull/8908#discussion_r1460841842:

Edit: as suspected it looks like PartialEq is being derived using a proc macro, and is therefore inconsistent with this implementation of PartialOrd - https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/expr.rs#L87. In particular PartialOrd could claim things to be equal due to a hash collision, when PartialEq would indicate they are not

So my conclusion is that this is a latent bug waiting to happen (that will be very hard to track down if/when it does strike)

ngli-me commented 3 weeks ago

Hi, I'm new to this project and was poking through the issues, can I try working this one out?

It looks like, to remove the hash we would need to first compare by enum variant (it looks like based on the test that the declaration order determines the Ord). I see two solutions for this, either implementing a "discriminant" function to determine the variant ordering, or by having a lot of match arms.

In the case of equivalent variant, we can then compare by fields, looks like it should work for the majority of them.

Is that the right track?

alamb commented 3 weeks ago

Hi, I'm new to this project and was poking through the issues, can I try working this one out?

It looks like, to remove the hash we would need to first compare by enum variant (it looks like based on the test that the declaration order determines the Ord). I see two solutions for this, either implementing a "discriminant" function to determine the variant ordering, or by having a lot of match arms.

In the case of equivalent variant, we can then compare by fields, looks like it should work for the majority of them.

Is that the right track?

I think so . I don't think the actual relative order of Exprs is important, only that it is consistent

It might be worth figuring out how to break this work into some smaller PRs (rather than one large one) -- perhaps by implementing partial ord for sub fields / structs that are used in Expr before trying to do so for the whole thing.

ngli-me commented 3 weeks ago

Good idea! I went back through to check all the DF sub fields/structs, will start the work + breaking it down.

PartialOrd Status

ngli-me commented 2 weeks ago

I was going through the remaining subtypes/structs, and saw an issue that I wasn't sure how to resolve. While most of the remaining can derive PartialOrd, it looks like DFSchema can't, and since it has a dependency on a Hashmap. A similar problem happens in CreateExternalTable, which uses two Hashmaps. Not sure how to approach this, any advice would be appreciated!

alamb commented 2 weeks ago

I would personally suggest trying to simply ignore the contents of the schema for partial ord -- I think in general the other fields in structs should be enough to compute equality and ordering