apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.57k stars 779 forks source link

Safe API to replace `NullBuffers` for Arrays #6528

Open alamb opened 3 weeks ago

alamb commented 3 weeks ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. While implementing https://github.com/apache/datafusion/pull/12792 and various other things in DataFusion I find myself often wanting to combine a filter and a null mask

The relevant code is like

        // combine existing nulls, if any, and a filter:
        let nulls = filtered_null_mask(opt_filter, input);

        // make a new array with a new null buffer
        let output: ArrayRef = match input.data_type() {
            // TODO it would be nice to have safe apis in arrow-rs to update the null buffers in the arrays
            DataType::Utf8 => {
                let input = input.as_string::<i32>();
                // safety: values / offsets came from a valid string array, so are valid utf8
                // and we checked nulls has the same length as values
                unsafe {
                    Arc::new(StringArray::new_unchecked(
                        input.offsets().clone(),
                        input.values().clone(),
                        nulls,
                    ))
                }
            }

Describe the solution you'd like I would like an API like with_nulls that returns a new array with the same data but a new null mask so my code would look like

        // combine existing nulls, if any, and a filter:
        let nulls = filtered_null_mask(opt_filter, input);

        // make a new array, with the same data but a new null buffer
        // panics if the nulls length doesn't match the array's length
        let output = input.with_nulls(nulls);

Describe alternatives you've considered I can keep using the unsafe APIs

Additional context

tustvold commented 3 weeks ago

I think this makes sense, the only thing to be careful with is arrays where null values may have undefined contents, e.g. dictionaries. In such cases, allowing users to go from a null to not null, could have safety implications

tustvold commented 3 weeks ago

FWIW the nullif kernel is very similar to this, but with the caveat that nulls can only remain null, avoiding the above issue

Edit: in fact the operation you describe is the nullif kernel I think...

tustvold commented 3 days ago

I think the nullif kernel provides this, so perhaps this can be closed?

alamb commented 2 days ago

Sounds good -- the current documentation on nullif is pretty sparse (and thus perhaps we can make it easier to discover / more likely people can find it) with some better docs

https://docs.rs/arrow/latest/arrow/compute/kernels/nullif/fn.nullif.html

I'll try and find some time

alamb commented 1 day ago

PR to improve the docs: https://github.com/apache/arrow-rs/pull/6658

After doing that it is somewhat of the inverse of what I was looking for (it sets the element to null when the mask is true, rather than setting the element to null when the mask is not true).

I will attempt a PR with a proposed API as well for consideration

alamb commented 1 day ago

I made https://github.com/apache/arrow-rs/pull/6659 to add Array::with_nulls but it has the issues @tustvold describes

The actual usecase I have is not to turn elements unnull, but instead make them null if a boolean array is not true (the inverse of what nullif does).

Maybe a better API would be something like nullifnot or similar.

I will ponder

tustvold commented 1 day ago

You could either negate the boolean array, or construct a BooleanArray with a null buffer of the buffer you want to be null if false, and a values buffer of entirely false. Neither is exactly ideal, but the performance, especially of the latter, is likely going to be hard to beat.

alamb commented 1 day ago

🤔

this is now we do it today in DataFusion: https://github.com/apache/datafusion/blob/f23360f1e0f90abcf92a067de55a9053da545ed2/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/nulls.rs#L53-L102

(basically call NullBuffer::Union) maybe that is good enough 🤔