Support creating arrays with non-nullable elements in `make_array`

Kimahriman commented 1 month ago

Is your feature request related to a problem or challenge?

make_array currently hard codes that all return types and arrays returned from invoke have a nullable element.

Describe the solution you'd like

Check the input expressions, and if all are non-nullable, the array element should be non-nullable

Describe alternatives you've considered

No response

Additional context

Discovered this while trying to implement array creation support in datafusion-comet

jayzhan211 commented 1 month ago

What might be the downside if the array is always nullable? Loss the possibility for non-null optimization? If there is such a need for most of the functions, we might not need return_type anymore but return_type_from_exprs instead. That is a huge change.

Kimahriman commented 1 month ago

What might be the downside if the array is always nullable? Loss the possibility for non-null optimization

That would be the main thing. In this case I came across this trying to integrate array support into datafusion-comet and it was inconsistent with Spark's handling of arrays. Maybe not the best reason to change, but it also seems weird to have nullability be tracked but not track it correctly.

If there is such a need for most of the functions, we might not need return_type anymore but return_type_from_exprs instead. That is a huge change.

The actual change per function isn't that much, working on what the make_array update would look like.

Kimahriman commented 1 month ago

One of the downstream affects would be ever so slightly increased encoding/decoding of Parquet files with arrays of non-nullable elements

jayzhan211 commented 1 month ago

@alamb, I think we could have a breaking change on return_type with ReturnTypeArgs, and remove return_type_from_exprs.

fn return_type(&self, _args: ReturnTypeArgs) -> Result<DataType> {

struct ReturnTypeArgs<'a> {
    args: &'a [Expr],
    schema: &'a dyn ExprSchema,
    arg_types: &'a [DataType],
}

Alternatively, we switch all the function from return_type to return_type_from_exprs, since I think most of them would expect the same nullable from input.

Kimahriman commented 1 month ago

After digging a little more it seems like there's some other issues that might make this significantly more difficult:

When invoke is called, there's no way to know the logical nullability of the ColumnarValue's/ArrayRef's being passed in. The only thing available is Array::is_nullable, but it looks like all that does it check if the current array has any nulls or not, so that could vary from batch to batch. So inside invoke there's no way to know if field inside
```
Ok(Arc::new(GenericListArray::<O>::try_new(
    Arc::new(Field::new("item", data_type, true)),
    OffsetBuffer::new(offsets.into()),
    arrow_array::make_array(data),
    None,
)?))
```
should be logically nullable or not. And arrow will verify this when creating a record batch, which is the error I ran into initially with the nullability not matching: https://github.com/apache/arrow-rs/blob/5c5a94a11f01a286dd03b18af0f11c327a9accc6/arrow-array/src/record_batch.rs#L205
ScalarUDF's in general don't have a concept of nullability, they are always considered nullable. Only tangentially related but just a larger issue of nullability and ScalarUDFs: https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/scalar_function.rs#L120

jayzhan211 commented 1 month ago

We could have another invoke_with_schema function to get the nullable from schema. Of course, another incompatible change 😢

We have schema here https://github.com/apache/datafusion/blob/18193e6224603c92ce1ab16136ffcd926ca267b5/datafusion/physical-expr/src/scalar_function.rs#L214-L239

jayzhan211 commented 1 month ago

I remember there was a discussion about replacing ColumnarValue::Array(ArrayRef) with Recordbatch, which contains Schema

#[derive(Clone, Debug, PartialEq)]
pub struct RecordBatch {
    schema: SchemaRef,
    columns: Vec<Arc<dyn Array>>,

    /// The number of rows in this RecordBatch
    ///
    /// This is stored separately from the columns to handle the case of no columns
    row_count: usize,
}

Maybe Recordbatch is what we need for invoke 🤔

apache / datafusion