apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.32k stars 682 forks source link

Implement Run Length Encoding (RLE) / Run End Encoding (REE) support (Epic) #3520

Open alamb opened 1 year ago

alamb commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Arrow has added REE support https://github.com/apache/arrow/pull/14176, similar to dictionary arrays that allow repeated values to be encoded in a space efficient manner that also allows fast processing.

Describe the solution you'd like Implement REE in arrow-rs. Some likely candidate:

Describe alternatives you've considered

Additional context

tustvold commented 1 year ago

One thing to perhaps give thought to is what kernels we need to facilitate this use-case. For example, a core pattern for dictionaries is to evaluate against the child values array and then take the results based on the keys. This avoids generating code for each combination of key and value type. We possibly need something similar for the REE use-case, i.e. some sort of take_runs kernel.

askoa commented 1 year ago

I would like to start this by writing REE Array.

alippai commented 1 year ago

Would be adding parquet support in the scope? I’m not sure if parquet has RLE directly, but highly compressed (zstd,snappy,zlib) values can be converted to REE arrays instead of inflating.

tustvold commented 1 year ago

Dictionaries are RLE encoded in parquet, so we could theoretically preserve RLE encoding for dictionary arrays, on top of the existing logic to preserve the dictionary encoding. However, the cost and complexity of translating between the two representations is likely to negate much of the potential gains.

zlib

Do you have a link for this, I'm not sure how you would reliably use the codec's internal RLE coding as it isn't guaranteed to be at a meaningful granularity for the encoded data?

alippai commented 1 year ago

@tustvold I agree that the reading performance might not improve but the downstream operations like filter/join/group by (if optimized for REE) would definitely make it worth. Absolutely not a day 0 feature.

The zlib/zstd idea is far fetched, but: 1.) zlib sometimes decides to store data as Z_RLE, might worth checking if this happens with Parquet too: https://optipng.sourceforge.net/pngtech/z_rle.html 2.) I was wondering if LZ77 -> REE is possible without fully decompressing the data, something like posted here (for a different algorithm): https://www.researchgate.net/publication/261072335_From_Run_Length_Encoding_to_LZ78_and_Back_Again

Both of these are super theoretical and needs access to the compression building blocks, the low level api (as you don’t want to ask the lib to decompress the data, but you want to go compressed -> REE to save big)

askoa commented 1 year ago

I am starting to write Iterator / ArrayAccessor for RunEndEncodedArray. I could see two scenarios for accessing the values

  1. The caller know the physical index and just wants value in that physical index.
    • This method is useful when implementing iterators. The iterator can decide if the physical index has to be increased in O(1) time. After the decision is made, the iterator needs to access the value in the physical index.
  2. The caller does not know the physical index and trying to access a value in logical index. A binary search has to be implemented to determine physical index based on logical index. This will take O(log N) time.
    • This method is useful when the caller wants to access only few random indices (may be in take kernels?)

My questions are below

  1. is it okay to implement two ArrayAccessor for RunEndEncodedArray. One based on logical index and one based on physical index?
  2. ArrayIter cannot be used as is to implement an Iterator. REE iterator needs to store two values and need a different increment logic. Is it okay to implement RunEndEncodedArrayIter and not use ArrayIter?

cc: @tustvold @viirya @alamb

tustvold commented 1 year ago

I'm not sure you can implement ArrayAccessor for REEArray as it doesn't know its value type? This is fine imo, we don't implement ArrayAccessor for DictionaryArray for much the same reason.

Providing an iterator abstraction, similar to TypedDictionaryArray, that downcasts the values and uses ArrayAccessor to "decode" the runs makes sense to me to help ergonomics

However, most kernels I imagine will need custom logic to handle RunEncodedArrays efficiently, e.g. take will need to parse the runs array and compute a new set of runs along with the take indices to apply to the values array. Filter will need to do something similar.

One thing to be extremely careful of, is to avoid generic code typed on both the key type and the value type in our kernels - this explodes codegen and has caused a lot of pain with dictionaries

askoa commented 1 year ago

Providing an iterator abstraction, similar to TypedDictionaryArray, that downcasts the values and uses ArrayAccessor to "decode" the runs makes sense to me to help ergonomics

Yes, I am planning to model ArrayAccessor after TypedDictionaryArray.

However, most kernels I imagine will need custom logic to handle RunEncodedArrays efficiently, e.g. take will need to parse the runs array and compute a new set of runs along with the take indices to apply to the values array. Filter will need to do something similar.

Does it make sense to add a function to RunEndEncodedArray that'll convert from logical index to physical index? I think it can be used by functions that are trying to decode arbitrary indices.

One thing to be extremely careful of, is to avoid generic code typed on both the key type and the value type in our kernels - this explodes codegen and has caused a lot of pain with dictionaries

I am planning to model after TypedDictionaryArray which has generics for both key and value. Would that be a problem?

tustvold commented 1 year ago

Would that be a problem?

Yes, wherever possible we should handle the indices and values separately, similarly to how we currently handle dictionaries. Otherwise you catastrophically explode code gen and build times, see https://github.com/apache/arrow-rs/issues/2596 and https://github.com/apache/arrow-datafusion/pull/4999 for some more context. Non-scalar binary kernels are the only ones where I foresee this not being possible, we should gate those behind feature flags like we do for dictionaries.

TypedDictionaryArray is useful for statically typed codepaths, but using it in kernels that must be generated for every combination of types ends up being problematic

fabianmurariu commented 3 months ago

Hi, I'm working on raphtory trying to get a query engine off the ground with datafusion. One of the key ingredients would be REE array support, because hopping around graphs can be expressed efficiently with REE.

Can I start looking into adding support for filtering or is there ongoing work that's required first?

alamb commented 3 months ago

Hi, I'm working on raphtory trying to get a query engine off the ground with datafusion. One of the key ingredients would be REE array support, because hopping around graphs can be expressed efficiently with REE.

That sounds like an awesome project 🙏

Can I start looking into adding support for filtering or is there ongoing work that's required first?

That would be great.

Some key features for arrays are (I am not sure offhand what REE has already, you would have to to some research)

  1. Cast support (to convert back/forth to other types
  2. Filter support (as you mentioned)
  3. IPC support (for arrow flight, for example)

Something else that might be worth looking at is the ability to read REE arrays directly from parquet (and avoid decoding) if you are reading for arrow

You might take a look at the list of items we made for StringView to get some inspiration: https://github.com/apache/arrow-rs/issues/5374