Open alamb opened 1 year ago
One thing to perhaps give thought to is what kernels we need to facilitate this use-case. For example, a core pattern for dictionaries is to evaluate against the child values array and then take
the results based on the keys. This avoids generating code for each combination of key and value type. We possibly need something similar for the REE use-case, i.e. some sort of take_runs
kernel.
I would like to start this by writing REE Array.
Would be adding parquet support in the scope? I’m not sure if parquet has RLE directly, but highly compressed (zstd,snappy,zlib) values can be converted to REE arrays instead of inflating.
Dictionaries are RLE encoded in parquet, so we could theoretically preserve RLE encoding for dictionary arrays, on top of the existing logic to preserve the dictionary encoding. However, the cost and complexity of translating between the two representations is likely to negate much of the potential gains.
zlib
Do you have a link for this, I'm not sure how you would reliably use the codec's internal RLE coding as it isn't guaranteed to be at a meaningful granularity for the encoded data?
@tustvold I agree that the reading performance might not improve but the downstream operations like filter/join/group by (if optimized for REE) would definitely make it worth. Absolutely not a day 0 feature.
The zlib/zstd idea is far fetched, but: 1.) zlib sometimes decides to store data as Z_RLE, might worth checking if this happens with Parquet too: https://optipng.sourceforge.net/pngtech/z_rle.html 2.) I was wondering if LZ77 -> REE is possible without fully decompressing the data, something like posted here (for a different algorithm): https://www.researchgate.net/publication/261072335_From_Run_Length_Encoding_to_LZ78_and_Back_Again
Both of these are super theoretical and needs access to the compression building blocks, the low level api (as you don’t want to ask the lib to decompress the data, but you want to go compressed -> REE to save big)
I am starting to write Iterator
/ ArrayAccessor
for RunEndEncodedArray
. I could see two scenarios for accessing the values
take
kernels?) My questions are below
ArrayAccessor
for RunEndEncodedArray
. One based on logical index and one based on physical index?ArrayIter
cannot be used as is to implement an Iterator
. REE iterator needs to store two values and need a different increment logic. Is it okay to implement RunEndEncodedArrayIter
and not use ArrayIter
?cc: @tustvold @viirya @alamb
I'm not sure you can implement ArrayAccessor for REEArray as it doesn't know its value type? This is fine imo, we don't implement ArrayAccessor for DictionaryArray for much the same reason.
Providing an iterator abstraction, similar to TypedDictionaryArray, that downcasts the values and uses ArrayAccessor to "decode" the runs makes sense to me to help ergonomics
However, most kernels I imagine will need custom logic to handle RunEncodedArrays efficiently, e.g. take will need to parse the runs array and compute a new set of runs along with the take indices to apply to the values array. Filter will need to do something similar.
One thing to be extremely careful of, is to avoid generic code typed on both the key type and the value type in our kernels - this explodes codegen and has caused a lot of pain with dictionaries
Providing an iterator abstraction, similar to TypedDictionaryArray, that downcasts the values and uses ArrayAccessor to "decode" the runs makes sense to me to help ergonomics
Yes, I am planning to model ArrayAccessor
after TypedDictionaryArray
.
However, most kernels I imagine will need custom logic to handle RunEncodedArrays efficiently, e.g. take will need to parse the runs array and compute a new set of runs along with the take indices to apply to the values array. Filter will need to do something similar.
Does it make sense to add a function to RunEndEncodedArray
that'll convert from logical index to physical index? I think it can be used by functions that are trying to decode arbitrary indices.
One thing to be extremely careful of, is to avoid generic code typed on both the key type and the value type in our kernels - this explodes codegen and has caused a lot of pain with dictionaries
I am planning to model after TypedDictionaryArray
which has generics for both key and value. Would that be a problem?
Would that be a problem?
Yes, wherever possible we should handle the indices and values separately, similarly to how we currently handle dictionaries. Otherwise you catastrophically explode code gen and build times, see https://github.com/apache/arrow-rs/issues/2596 and https://github.com/apache/arrow-datafusion/pull/4999 for some more context. Non-scalar binary kernels are the only ones where I foresee this not being possible, we should gate those behind feature flags like we do for dictionaries.
TypedDictionaryArray is useful for statically typed codepaths, but using it in kernels that must be generated for every combination of types ends up being problematic
Hi, I'm working on raphtory trying to get a query engine off the ground with datafusion. One of the key ingredients would be REE array support, because hopping around graphs can be expressed efficiently with REE.
Can I start looking into adding support for filtering or is there ongoing work that's required first?
Hi, I'm working on raphtory trying to get a query engine off the ground with datafusion. One of the key ingredients would be REE array support, because hopping around graphs can be expressed efficiently with REE.
That sounds like an awesome project 🙏
Can I start looking into adding support for filtering or is there ongoing work that's required first?
That would be great.
Some key features for arrays are (I am not sure offhand what REE has already, you would have to to some research)
Something else that might be worth looking at is the ability to read REE arrays directly from parquet (and avoid decoding) if you are reading for arrow
You might take a look at the list of items we made for StringView
to get some inspiration: https://github.com/apache/arrow-rs/issues/5374
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Arrow has added REE support https://github.com/apache/arrow/pull/14176, similar to dictionary arrays that allow repeated values to be encoded in a space efficient manner that also allows fast processing.
Describe the solution you'd like Implement REE in arrow-rs. Some likely candidate:
Describe alternatives you've considered
Additional context