Open rustyconover opened 2 months ago
cc @felipecrv
Comparing nested types for equality to run-end encode them can be expensive and unlikely to bring good compression rates. Run-end encoding works better on flat columns.
That said, the most likely reason for a kernel not supporting REE arrays yet is usually: it hasn't been implemented. Because, you know, it takes time ($) to implement custom kernels for REE arrays.
Hi @felipecrv
Why wouldn't it bring great compression rates if the developer knows the column is mostly constant values?
Why wouldn't it bring great compression rates if the developer knows the column is mostly constant values?
All it takes is a new random or misaligned column (struct field) to mess up the repetitiveness of the data.
If you know the data is mostly constant values, you don't need run_end_encode
, because you can produce the run-end encoded array directly without comparing the struct values.
You can also go for a struct of run-end-encoded fields (not all of them have to be run-end-encoded) and if the whole struct repeats you can share the same run_ends
array among the fields (no copying needed).
Describe the bug, including details regarding any error messages, version, and platform.
When attempting to use a
RunEndEncoded
array with either astruct
or alist
, an exception is raised indicating that no matching kernel is available.Steps to Reproduce:
Please run the following example code to reproduce the issue:
Observed Output:
Expected Behavior:
The code should correctly create a
RunEndEncoded
array using bothstruct
andlist
types without raising exceptions.Environment:
Additional Context:
The failure seems to suggest that the
run_end_encode
function does not currently supportstruct
orlist
types, but it's not explicitly documented whether this is intentional or an oversight.Component(s)
Python