tustvold commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Comparing primitive arrays for equality, perhaps in the context of a compute kernel, is relatively straightforward. A DataType::Int8 is equal to DataType::Int8 and not equal to DataType::UInt8.

For nested types such as StructArray, ListArray and RecordBatch this gets more complex, how strictly should we enforce that a schema is consistent. Should we allow an array to be of a different type to its schema, what about nullability or metadata?

We currently have a range of approaches:

RowConvert accepts any arrays so long as the array's DataType::equals_datatype, ignoring metadata and field names, but validating nullability
Parquet's ArrowWriter accept any RecordBatch so long as the writer's Schema::contains the provided batch schema, this forces nullability and metadata to be a subset
arrow-ipc, arrow-json, arrow-csv writers ignore the schema entirely, potentially allowing it to change between batches
Most kernels and array constructors use PartialEq to enforce consistency of their inputs
Array constructors enforce field nullability matches the presence of null buffers, this is non-trivial in the case of StructArray
MutableArrayData uses the types of the first batch and will panic if it encounters inconsistency

Describe the solution you'd like

I don't really know, eagerly performing validation can help to catch bugs and issues, but on the flip side it is frustrating to be validating things like field names, metadata, or even nullability, that in most cases won't make a different to correctness

Describe alternatives you've considered

Additional context

1888

3226

4799

alamb commented 1 year ago

More context: https://github.com/apache/arrow-rs/pull/4800#issuecomment-1711981166

I wonder if we could have "strict" and "non strict" schema checking -- e.g. for some things like arrow-json where there is a configuration object there is a natural place to add enforce_schema_equality: bool which users can then turn off of they don't want that definition of equality

However, functions like concat_batches don't have a natural way to configure behavior like this (ConcatOptions seems like maybe it would be overkill 🤔 )

tustvold commented 1 year ago

Perhaps concat, etc... could take an explicit schema, this would also sidestep the issues around an empty slice...

I could also be convinced to do away with all the validation, and just do explicit validation in the places it matters to correctness - e.g. parquet and nullability

alamb commented 1 year ago

concat_batches already does take an explicit schema 🤔

https://docs.rs/arrow/latest/arrow/compute/fn.concat_batches.html

pub fn concat_batches<'a>(
    schema: &Arc<Schema, Global>,
    input_batches: impl IntoIterator<Item = &'a RecordBatch>
) -> Result<RecordBatch, ArrowError>

tustvold commented 1 year ago

In which case perhaps that is the answer to #4800, just use the provided schema and don't perform any additional validation?

wjones127 commented 1 year ago

There are some field names that are kind of useless (such as those in map and list). In Arrow C++, we disabled checking those as part of equality unless specifically asked for. https://github.com/apache/arrow/pull/14847

Field metadata definitely matters, since that may contain extension type information.

I'm not sure about top-level schema metadata. In many cases I think I'd be fine ignoring that by default, or at least I haven't encountered a situation where I really wanted it.

waynexia commented 1 year ago

There are some field names that are kind of useless (such as those in map and list)

What do you think of removing field names from those types. I find them a bit annoying sometimes. Or is there any place it matters?

apache / arrow-rs

Consistent Schema Enforcement #4801

1888

3226

4799