Open tustvold opened 1 year ago
More context: https://github.com/apache/arrow-rs/pull/4800#issuecomment-1711981166
I wonder if we could have "strict" and "non strict" schema checking -- e.g. for some things like arrow-json where there is a configuration object there is a natural place to add enforce_schema_equality: bool
which users can then turn off of they don't want that definition of equality
However, functions like concat_batches
don't have a natural way to configure behavior like this (ConcatOptions
seems like maybe it would be overkill 🤔 )
Perhaps concat, etc... could take an explicit schema, this would also sidestep the issues around an empty slice...
I could also be convinced to do away with all the validation, and just do explicit validation in the places it matters to correctness - e.g. parquet and nullability
concat_batches
already does take an explicit schema 🤔
https://docs.rs/arrow/latest/arrow/compute/fn.concat_batches.html
pub fn concat_batches<'a>(
schema: &Arc<Schema, Global>,
input_batches: impl IntoIterator<Item = &'a RecordBatch>
) -> Result<RecordBatch, ArrowError>
In which case perhaps that is the answer to #4800, just use the provided schema and don't perform any additional validation?
There are some field names that are kind of useless (such as those in map
and list
). In Arrow C++, we disabled checking those as part of equality unless specifically asked for. https://github.com/apache/arrow/pull/14847
Field metadata definitely matters, since that may contain extension type information.
I'm not sure about top-level schema metadata. In many cases I think I'd be fine ignoring that by default, or at least I haven't encountered a situation where I really wanted it.
There are some field names that are kind of useless (such as those in map and list)
What do you think of removing field names from those types. I find them a bit annoying sometimes. Or is there any place it matters?
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Comparing primitive arrays for equality, perhaps in the context of a compute kernel, is relatively straightforward. A
DataType::Int8
is equal toDataType::Int8
and not equal toDataType::UInt8
.For nested types such as
StructArray
,ListArray
andRecordBatch
this gets more complex, how strictly should we enforce that a schema is consistent. Should we allow an array to be of a different type to its schema, what about nullability or metadata?We currently have a range of approaches:
DataType::equals_datatype
, ignoring metadata and field names, but validating nullabilitySchema::contains
the provided batch schema, this forces nullability and metadata to be a subsetDescribe the solution you'd like
I don't really know, eagerly performing validation can help to catch bugs and issues, but on the flip side it is frustrating to be validating things like field names, metadata, or even nullability, that in most cases won't make a different to correctness
Describe alternatives you've considered
Additional context
1888
3226
4799