Relax column type check in RecordBatch initialization

apache / arrow-rs

Official Rust implementation of Apache Arrow

https://arrow.apache.org/

Apache License 2.0

2.56k stars 768 forks source link

Relax column type check in RecordBatch initialization #3605

Open viirya opened 1 year ago

viirya commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

While working on internal project incorporating Arrow/DataFusion, we encounter some issues on reading non-dictionary and dictionary arrays for same column.

During initializatoin RecordBatch checks if each field type in given schema matches column type in arrays. But two issues are annoying:

Before reading in arrays, we only know datatype of column (not arrow datatype), so we don't know if it is dictionary or not. I.e., we only know it is T but don't know it is Dictionary<_, T>.
While reading different files, it is possible that same column is only dictionary encoded in some files instead of all files. It causes schema mismatch later.

I'm wondering if we can relax the column type check to allow compatible one like T to Dictionary<_, T>.

Describe the solution you'd like

Describe alternatives you've considered

We could not do this.

Additional context

jhorstmann commented 1 year ago

Interesting usecase.

Looking at our codebase I think there are many places that assume batch.columns()[n].data_type() == batch.schema().fields()[n].data_type(). We have a separate data type enum for the logical plan level though, so that could indicate a logical type of string, and then during physical execution it could be represented as StringArray, DictionaryArray or even LargeStringArray.

tustvold commented 1 year ago

I think it would be confusing for the RecordBatch to have a schema that doesn't match its contents, especially as this might have non-trivial implications for things like IPC. That being said, I could definitely see value in allowing DataFusion to be more relaxed about dictionaries, the only obvious issue being that many kernels won't work on a mix of dictionary and non-dictionary columns

A simple, but naive idea might be to add type coercions to DataFusion's schema adaption logic so that queries over heterogeneous schemas are possible. This would likely entail hydrating dictionaries, but I'm not sure this is necessarily avoidable

crepererum commented 1 year ago

A simple, but naive idea might be to add type coercions to DataFusion's schema adaption logic so that queries over heterogeneous schemas are possible. This would likely entail hydrating dictionaries, but I'm not sure this is necessarily avoidable

Note that depending on the batches and their sizes, it may be better to dictionary-encode the non-dictionary batches (e.g. when you have a large dictionary-encoded batch and a tiny non-dictionary-encoded one). That's also a good reason to not hide this somewhere in arrow but make this a deliberate step of the DataFusion type coercion.

Note that a similar issue will likely occur with RLE.

viirya commented 1 year ago

Currently at our integration side, once there is inconsistent column reading in (e.g. Dictionary<_T> for a column type T), we cast it to T and vice versa to make DataFusion happy. I think this is similar to the idea to add type coercions to DataFusion. I feel it is more reasonable to have this in upstream projection instead of handling it in downstream.

But this brings additional computation cost too. Not to mention that it drops the benefit of dictionary-encoding.

I think it would be confusing for the RecordBatch to have a schema that doesn't match its contents, especially as this might have non-trivial implications for things like IPC.

It doesn't sound to have mismatched schema and arrays in a RecordBatch. My idea is to modify (coerce?) given schema to match given arrays for dictionary-encoded ones. Maybe it could be configured by an option?

That being said, I could definitely see value in allowing DataFusion to be more relaxed about dictionaries, the only obvious issue being that many kernels won't work on a mix of dictionary and non-dictionary columns

Yea, this is one important concern. But I think this is already an issue even the schema of arrays read in is not changed.