apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.46k stars 729 forks source link

Incorrect values for `is_null` and `is_not_null` on `UnionArray` #6017

Open alamb opened 2 months ago

alamb commented 2 months ago

Describe the bug The is_null and is_not_null kernels are incorrect for UnionArrays

I believe the core problem is that UnionArray does not implement Array::logical_nulls

https://github.com/apache/arrow-rs/blob/b9562b9550b8ff4aa7be9859e56e467b1a3b3de6/arrow-array/src/array/union_array.rs#L445-L525

And instead falls back to the default implementation (which calls Array::nulls): https://github.com/apache/arrow-rs/blob/b9562b9550b8ff4aa7be9859e56e467b1a3b3de6/arrow-array/src/array/mod.rs#L212-L214

To Reproduce Run these tests in is_null.rs:


    #[test]
    fn test_null_array_is_not_null() {
        let a = NullArray::new(3);

        let res = is_not_null(&a).unwrap();

        let expected = BooleanArray::from(vec![false, false, false]);

        assert_eq!(expected, res);
        assert!(res.nulls().is_none());
    }

    #[test]
    fn test_dense_union_is_null() {
        // union of [{A=1}, {A=}, {B=3.2}, {B=}, {C="a"}, {C=}]
        let int_array = Int32Array::from(vec![Some(1), None]);
        let float_array = Float64Array::from(vec![Some(3.2), None]);
        let str_array = StringArray::from(vec![Some("a"), None]);
        let type_ids = [0, 0, 1, 1, 2, 2].into_iter().collect::<ScalarBuffer<i8>>();
        let offsets = [0, 1, 0, 1, 0, 1]
            .into_iter()
            .collect::<ScalarBuffer<i32>>();

        let children = vec![
            Arc::new(int_array) as Arc<dyn Array>,
            Arc::new(float_array),
            Arc::new(str_array),
        ];

        let array =
            UnionArray::try_new(union_fields(), type_ids, Some(offsets), children)
                .unwrap();

        let result = is_null(&array).unwrap();

        let expected = &BooleanArray::from(vec![false, true, false, true, false, true]);
        assert_eq!(expected, &result);
    }

    #[test]
    fn test_sparse_union_is_null() {
        // union of [{A=1}, {A=}, {B=3.2}, {B=}, {C="a"}, {C=}]
        let int_array = Int32Array::from(vec![Some(1), None, None, None, None, None]);
        let float_array =
            Float64Array::from(vec![None, None, Some(3.2), None, None, None]);
        let str_array = StringArray::from(vec![None, None, None, None, Some("a"), None]);
        let type_ids = [0, 0, 1, 1, 2, 2].into_iter().collect::<ScalarBuffer<i8>>();

        let children = vec![
            Arc::new(int_array) as Arc<dyn Array>,
            Arc::new(float_array),
            Arc::new(str_array),
        ];

        let array =
            UnionArray::try_new(union_fields(), type_ids, None, children).unwrap();

        let result = is_null(&array).unwrap();

        let expected = &BooleanArray::from(vec![false, true, false, true, false, true]);
        assert_eq!(expected, &result);
    }

    fn union_fields() -> UnionFields {
        [
            (0, Arc::new(Field::new("A", DataType::Int32, true))),
            (1, Arc::new(Field::new("B", DataType::Float64, true))),
            (2, Arc::new(Field::new("C", DataType::Utf8, true))),
        ]
            .into_iter()
            .collect()
    }
Full diff

```diff diff --git a/arrow-arith/src/boolean.rs b/arrow-arith/src/boolean.rs index ea8e12abbe2..0dd74a2d0b6 100644 --- a/arrow-arith/src/boolean.rs +++ b/arrow-arith/src/boolean.rs @@ -354,6 +354,8 @@ pub fn is_not_null(input: &dyn Array) -> Result { mod tests { use super::*; use std::sync::Arc; + use arrow_buffer::ScalarBuffer; + use arrow_schema::{DataType, Field, UnionFields}; #[test] fn test_bool_array_and() { @@ -911,4 +913,65 @@ mod tests { assert_eq!(expected, res); assert!(res.nulls().is_none()); } + + #[test] + fn test_dense_union_is_null() { + // union of [{A=1}, {A=}, {B=3.2}, {B=}, {C="a"}, {C=}] + let int_array = Int32Array::from(vec![Some(1), None]); + let float_array = Float64Array::from(vec![Some(3.2), None]); + let str_array = StringArray::from(vec![Some("a"), None]); + let type_ids = [0, 0, 1, 1, 2, 2].into_iter().collect::>(); + let offsets = [0, 1, 0, 1, 0, 1] + .into_iter() + .collect::>(); + + let children = vec![ + Arc::new(int_array) as Arc, + Arc::new(float_array), + Arc::new(str_array), + ]; + + let array = + UnionArray::try_new(union_fields(), type_ids, Some(offsets), children) + .unwrap(); + + let result = is_null(&array).unwrap(); + + let expected = &BooleanArray::from(vec![false, true, false, true, false, true]); + assert_eq!(expected, &result); + } + + #[test] + fn test_sparse_union_is_null() { + // union of [{A=1}, {A=}, {B=3.2}, {B=}, {C="a"}, {C=}] + let int_array = Int32Array::from(vec![Some(1), None, None, None, None, None]); + let float_array = + Float64Array::from(vec![None, None, Some(3.2), None, None, None]); + let str_array = StringArray::from(vec![None, None, None, None, Some("a"), None]); + let type_ids = [0, 0, 1, 1, 2, 2].into_iter().collect::>(); + + let children = vec![ + Arc::new(int_array) as Arc, + Arc::new(float_array), + Arc::new(str_array), + ]; + + let array = + UnionArray::try_new(union_fields(), type_ids, None, children).unwrap(); + + let result = is_null(&array).unwrap(); + + let expected = &BooleanArray::from(vec![false, true, false, true, false, true]); + assert_eq!(expected, &result); + } + + fn union_fields() -> UnionFields { + [ + (0, Arc::new(Field::new("A", DataType::Int32, true))), + (1, Arc::new(Field::new("B", DataType::Float64, true))), + (2, Arc::new(Field::new("C", DataType::Utf8, true))), + ] + .into_iter() + .collect() + } } ```

Expected behavior The tests should pass

Instead they currently error as is_null always returns false and is_not_null always returns true:

assertion `left == right` failed
  left: BooleanArray
[
  false,
  true,
  false,
  true,
  false,
  true,
]
 right: BooleanArray
[
  false,
  false,
  false,
  false,
  false,
  false,
]

Additional context This was found by @samuelcolvin on https://github.com/apache/datafusion/issues/11162

The reproducers are from his PR on https://github.com/apache/datafusion/pull/11321

samuelcolvin commented 2 months ago

@alamb I've realised we can significantly improve on this code now used in Datafusion:

If children/columns of the union are marked as nullable=false, we don't need to check them, that should be significantly faster for sparse unions, not sure it'll help as much on dense unions, but it might improve things a bit.

I think I can have a crack at this.

alamb commented 2 months ago

Thanks @samuelcolvin -- I think this is another reason that putting kernsl into arrow-rs is often a good idea. In the arrow-rs code we can focus on implementing just that kernel and optimizing it really well, rather than the supporting machinery that is required in DataFusion

gstvg commented 2 weeks ago

@samuelcolvin are you porting your implementation to UnionArray::logical_nulls() ? If not, I would like to work on it

samuelcolvin commented 2 weeks ago

@gstvg please go for it.