apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.52k stars 3.54k forks source link

[C++] Add option to consolidate key columns in hash join #31383

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Currently the hash join outputs key columns from both sides. On an outer join this can help distinguish between a row that matched but had entirely null payloads on one side and a row that didn't match on one side.

However, that distinction is sometimes not very important and many databases will simply coalesce the key columns into one. For example, we might get an outer join result today that looks like:


L_KEY | R_KEY | L_PAY | R_PAY
    0       0       x       Y
 NULL       1    NULL       Z
    2    NULL       A    NULL

Ideally we could specify a "combine key columns" option to get a result that looks like:


KEY | L_PAY | R_PAY
  0       x       Y
  1    NULL       Z
  2       A    NULL

This can be done today with an extra project step, and it isn't likely to offer much performance benefit, but from a usability perspective it would be nice if users didn't have to do this extra project step.

Reporter: Weston Pace / @westonpace

Related issues:

Note: This issue was originally created as ARROW-15957. Please see the migration documentation for further details.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: I suppose this is essentially a duplicate of ARROW-15838, or we can maybe keep that one for the R side of things.

asfimport commented 2 years ago

Alessandro Molina / @amol-: FYI, In some cases I don't think using project is a viable workaround.

For joins if suffixes are provided, you will only know the name of the columns after the join operation and thus it's fairly hard to build the right projection (you would have to manually. compute column collisions yourself). Especially since there is no way to do a "Project All" to get all the resulting columns from the join apart the duplicated keys

asfimport commented 2 years ago

Weston Pace / @westonpace: You should be able to reference the fields by index but I agree that is inconvenient.

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.