apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.53k stars 3.71k forks source link

MV_CONTAINS function throwing NullPointerException #17440

Closed aaronm-bi closed 2 weeks ago

aaronm-bi commented 3 weeks ago

Please provide a detailed title (e.g. "Broker crashes when using TopN query with Bound filter" instead of just "Broker crashes").

Affected Version

Druid 30.0.0

Description

There is a task wherein we need to upgrade Druid from 29.0.0 to 30.0.0 and some SQL queries started to break. Before the upgrade, the following SQL query worked. However, in Druid 30.0.0, this throws an NullPointerException.

To verify this, we tested using the sample data from Druid docs and used the function and error still persisted.

v29.0.0 - the query/function still works image

v30.0.0 - query fails (throws NullPointerException) image

Source data used: curl -O https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz

Query used:

SELECT
  MV_CONTAINS(JSON_QUERY_ARRAY(agent, '$.type'), 'Browser'),
  agent,
  session
FROM "kttm" 
vivek807 commented 2 weeks ago

The issue persists in the latest version, 31.0, and I am working on a resolution. I will update as soon as I have the update. Thanks

clintropolis commented 2 weeks ago

Oops, just noticed this issue, sorry for the delayed response.

fwiw a better way to write this query is

SELECT
  ARRAY_CONTAINS(JSON_VALUE(agent, '$.type' RETURNING VARCHAR ARRAY), 'Browser')
...
FROM "kttm_nested_1"

since JSON_VALUE expressions can be optimized quite significantly compared to JSON_QUERY/JSON_QUERY_ARRAY which process the raw json instead of the specialized nested field columns that are created when ingesting json columns. The is because this is slightly dependent on which Druid version created the json column, since in 28+ nested arrays of primitive types are stored much more optimally then in earlier versions of Druid, so if the segments were created with an older version then json_value would also fall back to processing the raw data to construct the array.

Also a side-note, the ARRAY_ functions should be preferred over the MV_ functions when interacting with actual array types such as stored in json columns and array columns. the MV_ functions are primarily for use with Druids older non-standard multi-value string columns, which present themselves as VARCHAR in the SQL layer and are sort of distinct from actual arrays, so mixing these up can sometimes result in strange type inference in the SQL layer when composing more complicated expressions.

Anyway, the reason the query is failing in this form is because of an optimization for ARRAY_CONTAINS that maybe shouldn't happen when used with JSON_QUERY_ARRAY, or rather when the native array_contains expression (which backs both ARRAY_CONTAINS and MV_CONTAINS in SQL) encounters a complex type. JSON_QUERY_ARRAY doesn't really know the type of the element it is extracting, so it is handled as COMPLEX<json>, however the specialization check for primitive arrays is trying to cast the rhs argument to match the element type of the array it will be checking in the case it is a literal.

This can be fixed by modifying https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/Function.java#L3977 to instead be something like

        if (lhsType == null || !(lhsType.isPrimitive() || lhsType.isPrimitiveArray())) {
          return this;
        }

where it fixes the issue because it bails out of the optimization path early and so falls back to the per row checking of type information to decide how to processing the function.

@vivek807 if you'd like to make a PR with that change and add a test for this to https://github.com/apache/druid/blob/master/processing/src/test/java/org/apache/druid/query/expression/NestedDataExpressionsTest.java#L394 I would be happy to approve it, else I can make a PR with a fix later this week. We also need to do array_overlap(json_query_array(...)) too since it has a similar optimization path and will need the same fix https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/Function.java#L3724).

vivek807 commented 2 weeks ago

Oops, just noticed this issue, sorry for the delayed response.

fwiw a better way to write this query is

SELECT
  ARRAY_CONTAINS(JSON_VALUE(agent, '$.type' RETURNING VARCHAR ARRAY), 'Browser')
...
FROM "kttm_nested_1"

since JSON_VALUE expressions can be optimized quite significantly compared to JSON_QUERY/JSON_QUERY_ARRAY which process the raw json instead of the specialized nested field columns that are created when ingesting json columns. The is because this is slightly dependent on which Druid version created the json column, since in 28+ nested arrays of primitive types are stored much more optimally then in earlier versions of Druid, so if the segments were created with an older version then json_value would also fall back to processing the raw data to construct the array.

Also a side-note, the ARRAY_ functions should be preferred over the MV_ functions when interacting with actual array types such as stored in json columns and array columns. the MV_ functions are primarily for use with Druids older non-standard multi-value string columns, which present themselves as VARCHAR in the SQL layer and are sort of distinct from actual arrays, so mixing these up can sometimes result in strange type inference in the SQL layer when composing more complicated expressions.

Anyway, the reason the query is failing in this form is because of an optimization for ARRAY_CONTAINS that maybe shouldn't happen when used with JSON_QUERY_ARRAY, or rather when the native array_contains expression (which backs both ARRAY_CONTAINS and MV_CONTAINS in SQL) encounters a complex type. JSON_QUERY_ARRAY doesn't really know the type of the element it is extracting, so it is handled as COMPLEX<json>, however the specialization check for primitive arrays is trying to cast the rhs argument to match the element type of the array it will be checking in the case it is a literal.

This can be fixed by modifying https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/Function.java#L3977 to instead be something like

        if (lhsType == null || !(lhsType.isPrimitive() || lhsType.isPrimitiveArray())) {
          return this;
        }

where it fixes the issue because it bails out of the optimization path early and so falls back to the per row checking of type information to decide how to processing the function.

@vivek807 if you'd like to make a PR with that change and add a test for this to https://github.com/apache/druid/blob/master/processing/src/test/java/org/apache/druid/query/expression/NestedDataExpressionsTest.java#L394 I would be happy to approve it, else I can make a PR with a fix later this week. We also need to do array_overlap(json_query_array(...)) too since it has a similar optimization path and will need the same fix https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/Function.java#L3724).

Thanks, @clintropolis, for the detailed explanation. I’ve implemented the fix and updated the unit tests as well. Please review.