Remove dictionary type from Arrow logical type

apache / arrow-rs

Official Rust implementation of Apache Arrow

https://arrow.apache.org/

Apache License 2.0

2.34k stars 686 forks source link

Remove dictionary type from Arrow logical type #4729

Open sunchao opened 10 months ago

sunchao commented 10 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We've seen a lot of issues related to dictionary array support in both arrow-rs and DataFusion. One of main reasons, I think, is that arrow-rs treats dictionary type as part of its logical type.

A better approach IMO is to consider dictionary type as a physical type property and hide it from the logical DataType. Correspondingly, arrow-rs shouldn't maintain separate Int32Array and Int32DictionaryArray, etc, but rather unifying the two and hide the encoding details inside the array implementation.

Describe the solution you'd like

Remove Dictionary from Arrow DataType.
Unify dictionary array with plain array implementation.

Describe alternatives you've considered

Not doing it, and live with the complexities.

Additional context

tustvold commented 10 months ago

I'm really not sure about this for a couple of reasons:

It would add a huge amount of additional API complexity, as the arrays would go from having a well defined layout to having potentially different layouts - functions like PrimitiveArray::values, PrimitiveArray::new would need to change, etc...
It would add a branch on value access, which will break vectorisation of most kernels
You would end up with duplicated dictionary logic spread across multiple array types
It is unclear how one would type the dictionary keys
It is unclear how this model would interact with ArrayData
This model does not seem to be followed by any arrow implementations AFAICT?
I'm unclear on the benefits of dictionary encoding primitives, in most cases the representation will be larger and slower to process. A Dictionary<Int32, Int32Array> can only ever be larger than the corresponding Int32Array
Related to the above, there is no additional complexity per array type, everything from primitives to lists is supported by the current approach
IMO DataType IS the physical type, it provides separate physical representations for the same logical types like decimals, intervals, floats, strings, etc... Query engines can then typically implement a logical type system on top of this

Perhaps you could expand upon what it is you are trying to achieve here

sunchao commented 10 months ago

Thanks @tustvold for the quick reply! I'm mainly throwing out ideas for discussion here, since we've been struggling with dictionary types for a while now in our internal integration with Arrow & DF.

It would add a huge amount of additional API complexity, as the arrays would go from having a well defined layout to having potentially different layouts - functions like PrimitiveArray::values, PrimitiveArray::new would need to change, etc...

Hmm why it would add API complexity? I'm thinking PrimitiveArray::values still share the same impl if it is non-dictionary, but returns unpacked values if it is? PrimitiveArray::new doesn't have to change, and we can have a separate constructor for the dictionary case (such as PrimitiveArray::new_dict).

It would add a branch on value access, which will break vectorisation of most kernels

I'm hoping the compiler is smart enough to optimize this away. I briefly tried some similar ideas here in Godbolt and it looks like LLVM can indeed do that. I also ran some benchmarks with basic add_scalar and add_dyn and didn't see any regression AFAICT.

You would end up with duplicated dictionary logic spread across multiple array types

Yes this is true, but I think it can be avoided though perhaps with some trait level default methods.

It is unclear how one would type the dictionary keys

TBH I don't have good ideas on this without breaking type safety. One thing we can do is to follow Arrow C++ and let compiler to optimize away the branching.

This model does not seem to be followed by any arrow implementations AFAICT?

Right, but I don't think this can be used as a strong reason to not doing it though. The idea is partially inspired by Velox which is compatible with Arrow.

I'm unclear on the benefits of dictionary encoding primitives, in most cases the representation will be larger and slower to process. A Dictionary<Int32, Int32Array> can only ever be larger than the corresponding Int32Array

Yes in certain cases it will be more beneficial to unpack primitive dictionary arrays first, as you've shown in https://github.com/apache/arrow-rs/pull/4407 :). This is orthogonal to this issue though, since the caller can decide whether to encode dictionary primitives or not. I think primitive dictionaries may still more beneficial in some other cases (e.g., filter)

IMO DataType IS the physical type, it provides separate physical representations for the same logical types like decimals, intervals, floats, strings, etc... Query engines can then typically implement a logical type system on top of this

That is true. In this specific case I'm talking about DataFusion though, which uses DataType as logical type I think?

tustvold commented 10 months ago

TLDR I would strongly resist changes to this design without extremely compelling empirical data to support it. Supporting dictionaries in the limited fashion we do has already caused more pain and misery than any other single piece of functionality, and I'm frankly somewhat terrified of broadening the scope of dictionary support further, especially in a manner that would implicate downstreams that may not even care about dictionaries...

since we've been struggling with dictionary types for a while now in our internal integration with Arrow & DF.

Have you tried just not using dictionaries? I'm being somewhat glib, but do you have empirical data to back up their usage?

The reason I ask is this is now the direction that we're taking with IOx. We've found that even for strings, unless the data is extremely low cardinality, the overheads of handling dictionaries massively outweigh any gains (by orders of magnitude). Once StringView lands we intend to replace almost all of our use of string dictionaries with them.

Hmm why it would add API complexity

Because currently if I have a PrimitiveArray, I know it has a contiguous data buffer, with efficient per-element access, defined values for null slots, etc... All of this is part of the current API contract. If instead the encoding is indeterminate, how to efficiently process the data also becomes indeterminate :sweat_smile:

On a more holistic level, pushing dictionaries into PrimitiveArray and friends would implicitly force all kernels to support dictionaries. This would either need to be custom logic, something we have found to be extremely painful to maintain and result in huge amounts of codegen, or would silently materialize the dictionary, potentially blowing the memory budget and having poor performance.

The current approach where type coercion machinery explicitly materializes dictionaries when necessary, with this explicitly called out in the plan and optimized to be only once, seems a better approach imo...

I think primitive dictionaries may still more beneficial in some other cases (e.g., filter)

Why would filtering a primitive dictionary be faster than filtering primitives, they're the same thing?

I'm hoping the compiler is smart enough to optimize this away

This has not been my experience, LLVM is extremely sensitive to any branches within the loops of kernels. Even the branching of Iterator::next is enough to cause it to not vectorise code. It has taking much futzing with inlining heuristics and unsafe code to get the current kernels to work as they currently do...

I briefly tried some similar ideas here in Godbolt and it looks like LLVM can indeed do that

You need to specify -C target-cpu=native -C opt-level=3 in order to get LLVM to vectorize anything at all. You also need to wrap the vector in std::hint::black_box to stop LLVM using compile time information it wouldn't have in practice. This then showcases what I'm referring to, it generates very suboptimal code for loops containing branches.

One thing we can do is to follow Arrow C++ and let compiler to optimize away the branching.

I would be very surprised if that is getting optimised correctly, I also can't seem to find any uses of it within the C++ kernels - https://github.com/search?q=repo%3Aapache%2Farrow%20GetValueIndex&type=code.

Right, but I don't think this can be used as a strong reason to not doing it though.

Perhaps not a strong reason, but diverging from standard practice normally warrants at least some additional care :smile:

That is true. In this specific case I'm talking about DataFusion though, which uses DataType as logical type I think?

DataFusion doesn't really have an explicit notion of logical types, but the various frontends such as SQL do have an implicit notion based around SQL datatypes.

Edit: you may notice I have a bit of a chip on my shoulder when it comes to dictionaries, for which I apologise, I've just found them to be a perennial source of frustration...

sunchao commented 10 months ago

TLDR I would strongly resist changes to this design without extremely compelling empirical data to support it.

Understood. Like I said, I mainly want to invite a discussion on this topic and see if there is anything we can do to improve dictionary support in Arrow :)

Have you tried just not using dictionaries? I'm being somewhat glib, but do you have empirical data to back up their usage?

To better integrate with DF, we are about to do early unpacking of dictionary arrays where value type is primitive type, and see how it goes. We still plan to keep the dictionary arrays with string & binary type as it is though. I'm surprised to hear that even string dictionaries do not perform well in your case. I can collect more data points on this.

On a more holistic level, pushing dictionaries into PrimitiveArray and friends would implicitly force all kernels to support dictionaries. This would either need to be custom logic, something we have found to be extremely painful to maintain and result in huge amounts of codegen, or would silently materialize the dictionary, potentially blowing the memory budget and having poor performance.

The current approach where type coercion machinery explicitly materializes dictionaries when necessary, with this explicitly called out in the plan and optimized to be only once, seems a better approach imo...

Assuming we want to treat dictionary array as a first-class citizen. In the current model, some kernels may support it while others don't, which causes this inconsistency that we've seen often in the past.

Like you said, the new model forces all the kernels to support dictionary, but wouldn't this better than having the inconsistencies? It may not necessarily mean more complexities, since some kernels like filter can directly operate on the unified API without needing to be aware of the underlying encoding.

Why would filtering a primitive dictionary be faster than filtering primitives, they're the same thing?

Because the filter predicate only need to be evaluated on the dictionary values, whose cardinality could be much lower? especially if the predicate is a complex one like a UDF.

You need to specify -C target-cpu=native -C opt-level=3 in order to get LLVM to vectorize anything at all. You also need to wrap the vector in std::hint::black_box to stop LLVM using compile time information it wouldn't have in practice. This then showcases what I'm referring to, it generates very suboptimal code for loops containing branches.

Hmm I tried it except instead of target-cpu=native, I use target-cpu=haswell. The result is here: https://rust.godbolt.org/z/Mbq7Yaeh5. I can still see SIMD getting triggered. Notice the jump chain in the main method: LBB25_7 -> LBB25_10 -> LBB25_15 -> LBB25_16. It appears the compiler can hoist the check onself.is_dictionary and self.validity_buffer out of the loop.

I would be very surprised if that is getting optimised correctly, I also can't seem to find any uses of it within the C++ kernels - https://github.com/search?q=repo%3Aapache%2Farrow%20GetValueIndex&type=code.

Yea it would be too bad if it isn't optimized 😂 . This function is used in ScalarFromArraySlotImpl::Visit(const DictionaryArray& a) which in turn is called by Array::GetScalar, which seems like a pretty fundamental method.

Edit: you may notice I have a bit of a chip on my shoulder when it comes to dictionaries, for which I apologise, I've just found them to be a perennial source of frustration...

Totally understand. We too share the frustration on this topic.

tustvold commented 10 months ago

Like you said, the new model forces all the kernels to support dictionary, but wouldn't this better than having the inconsistencies

I think kernels materializing dictionaries implicitly is a worse UX than DF explicitly doing this with type coercion. In both cases the performance will be poor, at least currently it is explicit and DF can avoid doing this more than once. Broadly speaking I think all the kernels where it makes sense to accommodate dictionaries, now support dictionaries in some form?

Because the filter predicate only need to be evaluated on the dictionary values, whose cardinality could be much lower? especially if the predicate is a complex one like a UDF.

Aah sorry I thought you meant filter in arrow parlance, i.e. a selection kernel.

Yes there are broadly speaking two types of kernels where dictionaries work well:

Unary kernels - where the function can just be applied to the dictionary values
Binary kernels with a scalar argument - as can be reformulated as a unary kernel on the dictionary values followed by a take

However, this requires explicit handling of the "dictionary" case. The proposed new model, so much as I understand it would not achieve this, and would be no better than DF coercing both inputs non-dictionary types?

I can still see SIMD getting triggered

Oh it is getting triggered, it is just generating very sub-optimal code :smile: There's a veritable wall of memory shuffle operators, I honestly have a hard time following what LLVM is doing...

which in turn is called by Array::GetScalar, which seems like a pretty fundamental method.

I would not expect array kernels to be calling such a method, rather I'd expect them to be vectorised and operating on the underlying buffers, I could be wrong though...

alamb commented 10 months ago

I have thoughts about this topic which I will write up coherently tomorrow.

sunchao commented 10 months ago

Broadly speaking I think all the kernels where it makes sense to accommodate dictionaries, now support dictionaries in some form?

We ran into some issues with dictionary arrays of primitive value types when migrating to use DF's new aggregation implementation. Unpacking the dictionaries early helped us to bypass the error. We are still in the process of migrating so we'll see 😂

However, this requires explicit handling of the "dictionary" case. The proposed new model, so much as I understand it would not achieve this, and would be no better than DF coercing both inputs non-dictionary types?

I think this part can be hidden from the kernels. Perhaps we can implement a map method on PrimitiveArray to handle the dictionary and non-dictionary case?

Slightly out of topic :) I'm hoping that we can also define some function API that allows people to implement kernels and UDFs easier. Something like:

/// A function that only takes a single argument
pub trait Function<T: TypeTrait, R: TypeTrait> {
    fn call(&self, arg: &T::Native, result: &mut R::Native) -> bool;
    fn call_batch(&self, arg: &PlainVector, result: &mut MutableVector) {
        // default implementation here
    }

Unifying the dictionary array may help a bit towards that direction.

Oh it is getting triggered, it is just generating very sub-optimal code 😄 There's a veritable wall of memory shuffle operators, I honestly have a hard time following what LLVM is doing...

Yea I have no idea on why LLVM does that. But it generates the same code even without the is_dictionary flag, so I doubt the flag is the thing to blame :)

tustvold commented 10 months ago

But it generates the same code even without the is_dictionary flag

You will probably need to also pull out the null check, i.e. remove all branches from the loop, then you'll get something sensible

I'm hoping that we can also define some function API that allows people to implement kernels and UDFs easier

I like this idea, I think @alamb also has some thoughts on this. This might be a less intrusive way to get better coverage of dictionaries. Ideally this might integrate with Datum, but I've so far found handling this well to vary depending on the kernel. Certainly for unary kernels something based off AnyDictionaryArray should work well - https://github.com/apache/arrow-rs/pull/4707

alamb commented 10 months ago

Opinion

Putting aside all practical implementation considerations initially, I think that removing DataType::Dictionary (and DataType::REE and DataType::StringView) from DataType is a good idea as it has the following benefits:

Allows the encoding to change from RecordBatch to RecordBatch, and thus can adapt to changing data rather than forcing a single static choice.
Simplifies the type management for downstream systems like DataFusion
Make it easy to incrementally support for new encodings (like REE, StringView) in the future without changes to downtream systems

For example, it likely makes sense for the parquet reader to provide dictionary encoded strings (to match what came out of parquet), and then unpack this data once it hits some kernel that doesn't support Dictionary encoding or the data is filtered it down where the dictionary encoding overhead outweighs its benefits

As pointed out by above, the major implication is that all the kernels would have to "support" Dictionary encoded data. I don't think this is as bad as it may initially seem: kernels without support for specific encodings could unpack the dictionaries (aka cast to the value type) and proceed. This is my understanding of how DuckDB works.

The primary benefit of the current situation is that it is backwards compatible

Steps forward

So in my mind, if there was some way to get from today to an API where the encoding wasn't part of the types that would be a great.

I was thinking last night maybe we could somehow use the new Datum API to achieve this. Single values are already some special encoding of arrays. Adding some way to encode dictionary information there too might fit.

tustvold commented 10 months ago

The primary benefit of the current situation is that it is backwards compatible

And compatible with the rest of the arrow ecosystem...

Putting aside all practical implementation considerations initially has the following benefits So in my mind, if there was some way to get from today to an API where the encoding wasn't part of the types that would be a great.

IMO the issue here is that DF is using the physical arrow type as its logical type system, what if we were to introduce an explicit logical type abstraction to DataFusion? This would not break arrow compatibility whilst also achieving these benefits?

alamb commented 10 months ago

IMO the issue here is that DF is using the physical arrow type as its logical type system, what if we were to introduce an explicit logical type abstraction to DataFusion? This would not break arrow compatibility whilst also achieving these benefits?

This is an intriguing idea. DataFusion already has the DFField and DFSchema wrappers -- perhaps we could use that abstraction (with improvements) to hide the "is it a dictionary" / "is it not a dictionary" from the type system 🤔

tustvold commented 10 months ago

is it a dictionary" / "is it not a dictionary

Could probably hide more than just dictionaries, offset sizes, and potentially things like decimal precision might also be candidates

sunchao commented 10 months ago

Instead of adding the new logical type in DF, what if we introduce it in arrow-rs? In this way, we can add a new API to Arrow arrays to return this logical type (for instance, Int32Array and Int32DictionaryArray would both return the same logical type). This can then be used by DF and potentially various kernels here.

tustvold commented 10 months ago

new API to Arrow arrays to return this logical type

An alternative might be to implement From<&DataType> for LogicalType, this would then allow writing array.data_type().into()? This would be possible in DF and would allow iteration without getting held up by arrow releases. Eventually we might upstream this abstraction, once stable, but I quite like the idea of DataFusion being able to define its own notion of logical types, which may differ from other arrow based systems.

sunchao commented 10 months ago

An alternative might be to implement From<&DataType> for LogicalType, this would then allow writing array.data_type().into()?

Yes, I like this idea. I agree we can start with DF and eventually move this to arrow-rs once it's stabilized.

alamb commented 10 months ago

Yes, I like this idea. I agree we can start with DF and eventually move this to arrow-rs once it's stabilized.

FWIW I plan to start writing up some of this context as a DataFusion issue hopefully later today

alamb commented 10 months ago

I filed https://github.com/apache/arrow-datafusion/discussions/7421 with a discussion in DataFusion

alamb commented 7 months ago

FYI @yukkit has created a PR showing how LogicalTypes might work in DataFusion: https://github.com/apache/arrow-datafusion/pull/8143. It is a pretty neat idea.