Open viirya opened 1 year ago
Interesting usecase.
Looking at our codebase I think there are many places that assume batch.columns()[n].data_type() == batch.schema().fields()[n].data_type()
. We have a separate data type enum for the logical plan level though, so that could indicate a logical type of string, and then during physical execution it could be represented as StringArray, DictionaryArray
I think it would be confusing for the RecordBatch to have a schema that doesn't match its contents, especially as this might have non-trivial implications for things like IPC. That being said, I could definitely see value in allowing DataFusion to be more relaxed about dictionaries, the only obvious issue being that many kernels won't work on a mix of dictionary and non-dictionary columns
A simple, but naive idea might be to add type coercions to DataFusion's schema adaption logic so that queries over heterogeneous schemas are possible. This would likely entail hydrating dictionaries, but I'm not sure this is necessarily avoidable
A simple, but naive idea might be to add type coercions to DataFusion's schema adaption logic so that queries over heterogeneous schemas are possible. This would likely entail hydrating dictionaries, but I'm not sure this is necessarily avoidable
Note that depending on the batches and their sizes, it may be better to dictionary-encode the non-dictionary batches (e.g. when you have a large dictionary-encoded batch and a tiny non-dictionary-encoded one). That's also a good reason to not hide this somewhere in arrow but make this a deliberate step of the DataFusion type coercion.
Note that a similar issue will likely occur with RLE.
Currently at our integration side, once there is inconsistent column reading in (e.g. Dictionary<_T>
for a column type T
), we cast it to T
and vice versa to make DataFusion happy. I think this is similar to the idea to add type coercions to DataFusion. I feel it is more reasonable to have this in upstream projection instead of handling it in downstream.
But this brings additional computation cost too. Not to mention that it drops the benefit of dictionary-encoding.
I think it would be confusing for the RecordBatch to have a schema that doesn't match its contents, especially as this might have non-trivial implications for things like IPC.
It doesn't sound to have mismatched schema and arrays in a RecordBatch. My idea is to modify (coerce?) given schema to match given arrays for dictionary-encoded ones. Maybe it could be configured by an option?
That being said, I could definitely see value in allowing DataFusion to be more relaxed about dictionaries, the only obvious issue being that many kernels won't work on a mix of dictionary and non-dictionary columns
Yea, this is one important concern. But I think this is already an issue even the schema of arrays read in is not changed.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While working on internal project incorporating Arrow/DataFusion, we encounter some issues on reading non-dictionary and dictionary arrays for same column.
During initializatoin
RecordBatch
checks if each field type in given schema matches column type in arrays. But two issues are annoying:T
but don't know it isDictionary<_, T>
.I'm wondering if we can relax the column type check to allow compatible one like
T
toDictionary<_, T>
.Describe the solution you'd like
Describe alternatives you've considered
We could not do this.
Additional context