Open Kimahriman opened 1 month ago
What might be the downside if the array is always nullable? Loss the possibility for non-null optimization?
If there is such a need for most of the functions, we might not need return_type
anymore but return_type_from_exprs
instead. That is a huge change.
What might be the downside if the array is always nullable? Loss the possibility for non-null optimization
That would be the main thing. In this case I came across this trying to integrate array support into datafusion-comet
and it was inconsistent with Spark's handling of arrays. Maybe not the best reason to change, but it also seems weird to have nullability be tracked but not track it correctly.
If there is such a need for most of the functions, we might not need
return_type
anymore butreturn_type_from_exprs
instead. That is a huge change.
The actual change per function isn't that much, working on what the make_array
update would look like.
One of the downstream affects would be ever so slightly increased encoding/decoding of Parquet files with arrays of non-nullable elements
@alamb, I think we could have a breaking change on return_type
with ReturnTypeArgs
, and remove return_type_from_exprs
.
fn return_type(&self, _args: ReturnTypeArgs) -> Result<DataType> {
struct ReturnTypeArgs<'a> {
args: &'a [Expr],
schema: &'a dyn ExprSchema,
arg_types: &'a [DataType],
}
Alternatively, we switch all the function from return_type
to return_type_from_exprs
, since I think most of them would expect the same nullable
from input.
After digging a little more it seems like there's some other issues that might make this significantly more difficult:
invoke
is called, there's no way to know the logical nullability of the ColumnarValue's/ArrayRef's being passed in. The only thing available is Array::is_nullable
, but it looks like all that does it check if the current array has any nulls or not, so that could vary from batch to batch. So inside invoke
there's no way to know if field inside
Ok(Arc::new(GenericListArray::<O>::try_new(
Arc::new(Field::new("item", data_type, true)),
OffsetBuffer::new(offsets.into()),
arrow_array::make_array(data),
None,
)?))
should be logically nullable or not. And arrow will verify this when creating a record batch, which is the error I ran into initially with the nullability not matching: https://github.com/apache/arrow-rs/blob/5c5a94a11f01a286dd03b18af0f11c327a9accc6/arrow-array/src/record_batch.rs#L205
We could have another invoke_with_schema
function to get the nullable
from schema. Of course, another incompatible change 😢
We have schema here https://github.com/apache/datafusion/blob/18193e6224603c92ce1ab16136ffcd926ca267b5/datafusion/physical-expr/src/scalar_function.rs#L214-L239
I remember there was a discussion about replacing ColumnarValue::Array(ArrayRef) with Recordbatch, which contains Schema
#[derive(Clone, Debug, PartialEq)]
pub struct RecordBatch {
schema: SchemaRef,
columns: Vec<Arc<dyn Array>>,
/// The number of rows in this RecordBatch
///
/// This is stored separately from the columns to handle the case of no columns
row_count: usize,
}
Maybe Recordbatch is what we need for invoke
🤔
Is your feature request related to a problem or challenge?
make_array
currently hard codes that all return types and arrays returned from invoke have a nullable element.Describe the solution you'd like
Check the input expressions, and if all are non-nullable, the array element should be non-nullable
Describe alternatives you've considered
No response
Additional context
Discovered this while trying to implement array creation support in datafusion-comet