Wrong pre aggregation result for boolean dimension containing NULL values

igorcalabria commented 1 year ago

Describe the bug Hi, we noticed that some of our preAggregations are returning inconsistent values for boolean dimensions. At first, we believed that this might be a import issue and cubestore was simply reporting the data it had, but we queried it directly and confirmed that something strange is happening. With a direct connection to cubestore on mysql port we did the following:

SELECT 
  my_model__bool_col,
  COUNT(*) AS total_entries 
FROM prod_pre_aggregations.my_model_pre_agg GROUP BY 1;

The query returned

+--------------------+---------------+
| my_model__bool_col | total_entries |
+--------------------+---------------+
| true               |          9393 |
| false              |       3888043 |
+--------------------+---------------+

true and false values are reported correctly but it's missing the NULLs. Strangely, we can do the following

SELECT COUNT(*) AS total_entries 
FROM prod_pre_aggregations.my_model_pre_agg 
WHERE my_model__bool_col IS NULL;

And it reports the correct NULL values

+---------------+
| total_entries |
+---------------+
|        464018 |
+---------------+

We're using a deployment with separate workers and a router, but couldn't find anything wrong with our configuration. Besides that, every query that's not aggregating on a boolean dimension seems to work fine, so it seems unlikely that this is caused by a miss configuration.

I even patched cube to log the partial data on each worker before it is returned to the main one and individually each worker already responded with a wrong aggregation (missing the nulls, even tough the partitions contained it)

To Reproduce

I didn't manage to reproduce this with artificial data, but I'll update the issue as soon as I have. I'm opening it now in case somebody has experienced the same thing

Version: 0.34.0

igorcalabria commented 1 year ago

To reproduce this (works on cubecloud too). Run these on cubestore:

CREATE SCHEMA test;

CREATE TABLE test.null_test (id int, bool_col boolean);

INSERT INTO test.null_test (id, bool_col) VALUES (0, false), (1, true), (2, NULL), (3, NULL), (4, true), (5, false);

MySQL [(none)]>  SELECT bool_col, count(*) FROM test.null_test GROUP BY 1;

+----------+-----------------+
| bool_col | COUNT(UInt8(1)) |
+----------+-----------------+
| false    |               4 |
| true     |               2 |
+----------+-----------------+

MySQL [(none)]>  SELECT count(*) FROM test.null_test WHERE bool_col IS NULL;

+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
|               2 |
+-----------------+

This issue goes beyond just dropping NULL values from aggregations, depending on the order, it may return wrong results. For example, if NULLS come first with adjacent False columns.

INSERT INTO test.null_test (id, bool_col) VALUES (0, NULL), (1, NULL), (2, NULL), (3, false), (4, true), (5, true);

MySQL [(none)]>  SELECT bool_col, count(*) FROM test.null_test GROUP BY 1;
+----------+-----------------+
| bool_col | COUNT(UInt8(1)) |
+----------+-----------------+
| NULL     |               4 |
| true     |               2 |
+----------+-----------------+

There are only 3 NULL columns, but it returned 4. Probably because the engine slurped the false as NULL too.

This seems releated to https://github.com/apache/arrow-datafusion/issues/782 which was patched upstream a couple years ago. Cube also applied a similar patch https://github.com/cube-js/arrow-datafusion/commit/e524173ba6ea22ca7652803f128bd4e8e2ebb434 but it's not the same as the one applied upstream. Using cubestore version of datafusion and querying the same parquet file used by cube we get the same result so this points to a problem with cube's version. Using the newest version of datafusion produces the correct results

igorcalabria commented 1 year ago

I think I've found the problem. https://github.com/cube-js/arrow-datafusion/blob/cube/datafusion/src/physical_plan/hash_aggregate.rs#L588 Does not deal with NULLs, só any type where NULL looks like an actual value will be grouped together. This is most apparent with booleans since there are only 2 values (NULL = 0, False = 0, True = 1) but any other type is also susceptible to this bug. For example, I just tested with a INT table and 0 and NULL where grouped together too.

A possible patch is just to use the same strategy that upstream arrow adopted at first in https://github.com/apache/arrow-datafusion/pull/793

diff --git a/datafusion/src/physical_plan/hash_aggregate.rs b/datafusion/src/physical_plan/hash_aggregate.rs
index fc0bf4af8..357b3779a 100644
--- a/datafusion/src/physical_plan/hash_aggregate.rs
+++ b/datafusion/src/physical_plan/hash_aggregate.rs
@@ -783,7 +783,12 @@ pub(crate) fn create_key(
 ) -> Result<()> {
     vec.clear();
     for col in group_by_keys {
-        create_key_for_col(col, row, vec)?
+        if !col.is_valid(row) {
+            vec.push(0xFE);
+        } else {
+            vec.push(0xFF);
+            create_key_for_col(col, row, vec)?
+        }
     }
     Ok(())
 }

This strategy was later dropped for a more optimized approach.

paveltiunov commented 1 year ago

Hey @igorcalabria ! Thanks for posting this one! If you found the cause, please don't hesitate to provide PR.

cube-js / cube

Wrong pre aggregation result for boolean dimension containing NULL values #7197