apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.64k stars 3.55k forks source link

[Python] run_end_encode doesn't support structs and lists #44183

Open rustyconover opened 2 months ago

rustyconover commented 2 months ago

Describe the bug, including details regarding any error messages, version, and platform.

When attempting to use a RunEndEncoded array with either a struct or a list, an exception is raised indicating that no matching kernel is available.

Steps to Reproduce:

Please run the following example code to reproduce the issue:

import pyarrow as pa

for data_type, values in [
    [pa.struct([pa.field("age", pa.int32())]), {"age": 20}],
    [pa.list_(pa.int32()), [20]]
]:
    try:
        schema = pa.schema([pa.field("data", pa.run_end_encoded(pa.int16(), data_type))])

        data = [
            {"data": values},
        ]

        table = pa.Table.from_pylist(data, schema=schema)
    except Exception as e:
        print(f"Failed with {data_type} {e}")

Observed Output:

Failed with struct<age: int32> Function 'run_end_encode' has no kernel matching input types (struct<age: int32>)
Failed with list<item: int32> Function 'run_end_encode' has no kernel matching input types (list<item: int32>)

Expected Behavior:

The code should correctly create a RunEndEncoded array using both struct and list types without raising exceptions.

Environment:

Additional Context:

The failure seems to suggest that the run_end_encode function does not currently support struct or list types, but it's not explicitly documented whether this is intentional or an oversight.

Component(s)

Python

mapleFU commented 2 months ago

cc @felipecrv

felipecrv commented 2 months ago

Comparing nested types for equality to run-end encode them can be expensive and unlikely to bring good compression rates. Run-end encoding works better on flat columns.

That said, the most likely reason for a kernel not supporting REE arrays yet is usually: it hasn't been implemented. Because, you know, it takes time ($) to implement custom kernels for REE arrays.

rustyconover commented 2 months ago

Hi @felipecrv

Why wouldn't it bring great compression rates if the developer knows the column is mostly constant values?

felipecrv commented 2 months ago

Why wouldn't it bring great compression rates if the developer knows the column is mostly constant values?

All it takes is a new random or misaligned column (struct field) to mess up the repetitiveness of the data.

If you know the data is mostly constant values, you don't need run_end_encode, because you can produce the run-end encoded array directly without comparing the struct values.

You can also go for a struct of run-end-encoded fields (not all of them have to be run-end-encoded) and if the whole struct repeats you can share the same run_ends array among the fields (no copying needed).