apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.28k stars 3.47k forks source link

[C++][Acero] Add support for `ListType` for non-key fields to Hash Joins #43716

Open anjakefala opened 1 month ago

anjakefala commented 1 month ago

Describe the enhancement requested

Acero's Hash Join does not support ListType in non-key fields for a hash join: https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/hash_join_node.cc#L48 . This is a request to add that support.

PyArrow code that reproduces here:

import pyarrow as pa
import pyarrow.acero as acero

# Creating the Arrow tables
basic_tbl = pa.table({'x': [1, 2, 3], 'y': ['a', 'b', 'c']})
basic_tbl_src = acero.Declaration("table_source", options=acero.TableSourceNodeOptions(basic_tbl))

basic_tbl2 = pa.table({'x': [1, 2, 3], 'z': [True, False, True]})
basic_tbl2_src = acero.Declaration("table_source", options=acero.TableSourceNodeOptions(basic_tbl2))

list_tbl = pa.table({'z': [['first', 'list', 'col', 'row'], ['second row', 'here']], 'x': [1, 2]})
list_tbl_src = acero.Declaration("table_source", options=acero.TableSourceNodeOptions(list_tbl))

join_keys = ["x"]

hash_join_options = acero.HashJoinNodeOptions('left outer', left_keys=join_keys, right_keys=join_keys)

joined = acero.Declaration(
        "hashjoin", options=hash_join_options, inputs=[basic_tbl_src, basic_tbl2_src])

result = joined.to_table()
print(result)

# list table
joined = acero.Declaration(
        "hashjoin", options=hash_join_options, inputs=[basic_tbl_src, list_tbl_src])

result = joined.to_table()
print(result)

R code here: https://issues.apache.org/jira/browse/ARROW-14519

In that link, the reason there currently isn't support was noted:

We cannot easily support more types in hash join right now. That is because we transform and encode all the input values, key and non-key (row_encoder.h), so it would need another specialization for each additional type.

So to add this support, it seems like we will need to add the specialisation for the encoding of ListType.

Component(s)

C++

mapleFU commented 4 weeks ago

Would you like to do this? I'm willing to do this and I take the issue https://github.com/apache/arrow/issues/31180 . But I may need a week to get familiar with RowTable

anjakefala commented 4 weeks ago

@mapleFU Oh, please go ahead! =)) Can you please ping me when you have a PR open?

mapleFU commented 3 weeks ago

@zanmato1984 When I gothrough the code I found there're three joins: SwissJoin, AsofJoin and HashJoin

Should I support these types in HashJoin first? Or these can be handle in one patch?

zanmato1984 commented 3 weeks ago

@zanmato1984 When I gothrough the code I found there're three joins: SwissJoin, AsofJoin and HashJoin

Should I support these types in HashJoin first? Or these can be handle in one patch?

HashJoin and SwissJoin are two different implantations of the hash join in SQL. AsofJoin is another type of join in SQL.

We can start with either HashJoin or SwissJoin.

mapleFU commented 3 weeks ago

Thanks! And just curious, do we have any doc or user-cases comparing the HashJoin and SwissJoin in Acero?

jorisvandenbossche commented 3 weeks ago

Two older issues (of which this is probably a duplicate):

zanmato1984 commented 3 weeks ago

Thanks! And just curious, do we have any doc or user-cases comparing the HashJoin and SwissJoin in Acero?

Not particularly. The only doc related AFAIK might be this one: https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/doc/key_map.md

mapleFU commented 3 weeks ago

I've get familiar with part HashJoin code ( But I didn't go through the code for SwissJoin). The proposal encoding and interface will be sent next week

mapleFU commented 2 weeks ago

I've read the related code in Velox[1] and arrow-rs[2], they use similiar encoding here.

List:

[null-flag][element-count] [ element +]

Element list is stored together or in variable-length area.

Struct:

[null-flag][elements]

I'd like to:

  1. Finish https://github.com/apache/arrow/issues/43758 firstly
  2. Support (non-nested) List in RowEncoder https://github.com/apache/arrow/issues/43911 secondly
  3. Support in HashJoin / SwissJoin

In the future more types ( included nested ) can be supports

cc @pitrou @zanmato1984

[1] https://github.com/facebookincubator/velox/blob/db8875c425e8132f553adf12e106cd2e28a811c0/velox/exec/ContainerRowSerde.cpp [2] https://github.com/apache/arrow-rs/blob/master/arrow-row/src/lib.rs#L147

anjakefala commented 2 weeks ago

Thank you for keeping me updated @mapleFU!

mapleFU commented 1 week ago

The comment patch is merged, I'll draft a ListKeyEncoder this week, see: https://github.com/apache/arrow/issues/43911