apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.86k stars 1.11k forks source link

Inconsistent behavior in HashJoin Projections #10978

Open adragomir opened 2 months ago

adragomir commented 2 months ago

Describe the bug

We ran into problems with projections inside HashJoin.

Each schema in the join (left / right) has:

The projection is [0, 2] - the struct column from left, and the struct column from right

The join column is not specified in the output. When trying to optimize the join and reverse the order, the projection is swapped as [2, 0], however there is no column with index 2 in the output, as the output contains only the 2 structs

To Reproduce

Expected behavior

The hash join optimization works, even when swapping the join order (and wrapping in a ProjectionExec)

Additional context

Reading the comment for HashJoinExec::projection it says The projection indices of the columns in the output schema of join, however

I tried taking a stab at it, but it's unclear what the meaning of what is passed in projections is. For now, I am fixing it surgically when swapping the order - I am rewriting the projections to be relative to the output schema when wrapping the join with a ProjectionExec

adragomir commented 2 months ago

You can see my temporary possible fix here: https://github.com/hstack/arrow-datafusion/commit/a4ab67d9dde53c3c13c92eb7070029282f0e837d

my-vegetable-has-exploded commented 2 months ago

It seems a bug related to me, thanks for catching it. I would take a look later.