Open alamb opened 2 months ago
@alamb I've realised we can significantly improve on this code now used in Datafusion:
If children/columns of the union are marked as nullable=false
, we don't need to check them, that should be significantly faster for sparse unions, not sure it'll help as much on dense unions, but it might improve things a bit.
I think I can have a crack at this.
Thanks @samuelcolvin -- I think this is another reason that putting kernsl into arrow-rs is often a good idea. In the arrow-rs code we can focus on implementing just that kernel and optimizing it really well, rather than the supporting machinery that is required in DataFusion
@samuelcolvin are you porting your implementation to UnionArray::logical_nulls()
? If not, I would like to work on it
@gstvg please go for it.
Describe the bug The
is_null
andis_not_null
kernels are incorrect forUnionArrays
I believe the core problem is that
UnionArray
does not implementArray::logical_nulls
https://github.com/apache/arrow-rs/blob/b9562b9550b8ff4aa7be9859e56e467b1a3b3de6/arrow-array/src/array/union_array.rs#L445-L525
And instead falls back to the default implementation (which calls
Array::nulls
): https://github.com/apache/arrow-rs/blob/b9562b9550b8ff4aa7be9859e56e467b1a3b3de6/arrow-array/src/array/mod.rs#L212-L214To Reproduce Run these tests in
is_null.rs
:Full diff
```diff diff --git a/arrow-arith/src/boolean.rs b/arrow-arith/src/boolean.rs index ea8e12abbe2..0dd74a2d0b6 100644 --- a/arrow-arith/src/boolean.rs +++ b/arrow-arith/src/boolean.rs @@ -354,6 +354,8 @@ pub fn is_not_null(input: &dyn Array) -> Result {
mod tests {
use super::*;
use std::sync::Arc;
+ use arrow_buffer::ScalarBuffer;
+ use arrow_schema::{DataType, Field, UnionFields};
#[test]
fn test_bool_array_and() {
@@ -911,4 +913,65 @@ mod tests {
assert_eq!(expected, res);
assert!(res.nulls().is_none());
}
+
+ #[test]
+ fn test_dense_union_is_null() {
+ // union of [{A=1}, {A=}, {B=3.2}, {B=}, {C="a"}, {C=}]
+ let int_array = Int32Array::from(vec![Some(1), None]);
+ let float_array = Float64Array::from(vec![Some(3.2), None]);
+ let str_array = StringArray::from(vec![Some("a"), None]);
+ let type_ids = [0, 0, 1, 1, 2, 2].into_iter().collect::>();
+ let offsets = [0, 1, 0, 1, 0, 1]
+ .into_iter()
+ .collect::>();
+
+ let children = vec![
+ Arc::new(int_array) as Arc,
+ Arc::new(float_array),
+ Arc::new(str_array),
+ ];
+
+ let array =
+ UnionArray::try_new(union_fields(), type_ids, Some(offsets), children)
+ .unwrap();
+
+ let result = is_null(&array).unwrap();
+
+ let expected = &BooleanArray::from(vec![false, true, false, true, false, true]);
+ assert_eq!(expected, &result);
+ }
+
+ #[test]
+ fn test_sparse_union_is_null() {
+ // union of [{A=1}, {A=}, {B=3.2}, {B=}, {C="a"}, {C=}]
+ let int_array = Int32Array::from(vec![Some(1), None, None, None, None, None]);
+ let float_array =
+ Float64Array::from(vec![None, None, Some(3.2), None, None, None]);
+ let str_array = StringArray::from(vec![None, None, None, None, Some("a"), None]);
+ let type_ids = [0, 0, 1, 1, 2, 2].into_iter().collect::>();
+
+ let children = vec![
+ Arc::new(int_array) as Arc,
+ Arc::new(float_array),
+ Arc::new(str_array),
+ ];
+
+ let array =
+ UnionArray::try_new(union_fields(), type_ids, None, children).unwrap();
+
+ let result = is_null(&array).unwrap();
+
+ let expected = &BooleanArray::from(vec![false, true, false, true, false, true]);
+ assert_eq!(expected, &result);
+ }
+
+ fn union_fields() -> UnionFields {
+ [
+ (0, Arc::new(Field::new("A", DataType::Int32, true))),
+ (1, Arc::new(Field::new("B", DataType::Float64, true))),
+ (2, Arc::new(Field::new("C", DataType::Utf8, true))),
+ ]
+ .into_iter()
+ .collect()
+ }
}
```
Expected behavior The tests should pass
Instead they currently error as
is_null
always returns false andis_not_null
always returns true:Additional context This was found by @samuelcolvin on https://github.com/apache/datafusion/issues/11162
The reproducers are from his PR on https://github.com/apache/datafusion/pull/11321