Closed harnasz closed 1 year ago
I've done a little bit of digging on this and this bug applies to all string dimension columns in the IncrementalIndexStorageAdapter. It seems that regardless of whether a multi value was inserted into a column or not, this storage adapter sets all string columns to be multi value.
e.g. for the example above while it hasn't been persisted a query for segment metadata results in this:
"currency": { "cardinality": 2, "errorMessage": null, "hasMultipleValues": true, "maxValue": "GBP", "minValue": "EUR", "size": 0, "type": "STRING" }
Whereas the persisted data returns hasMultipleValues correctly as false, it seems this results in inconsistencies when using any kind of string function against a dimensional column that has not yet been persisted vs data that has been persisted. So I think this problem is bigger than just the above report.
I verified this by amending the following to return false and this then correctly returns just a string value instead of an array. However I'm aware this may break multi-values on ingestion?
Happy to take a look further if anyone can offer any advice as to how to tackle this problem. I believe @gianm wrote some of this code, I'm hoping you might be able to offer some advice?
Thanks
I noticed this same issue recently. @Synforge were you able to contribute to a solution for this?
At the moment we've just used the following work around (in native query), to ensure any "multi value" dimensions are flattened on response from rows in heap:
array_to_string(non_mv_dimension, '')
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
We are seeing issues when using Kafka Streaming Ingestion and then querying and invoking the CONCAT
expression
believing this could be due to the rows being in the aggregate heap memory and not yet being persisted to the segments.We have witnessed the following behavior of the expression in the following circumstances which we believe are when the rows are still in the heap and yet to be persisted:
["
and"]
to the result of the expression when the rows are still within the heap.["
and"]
to the result of the expression.CONCAT
function as the key to retrieve a value from the lookup in an aggregated query nested in theSUM
function is not returning back the value of the lookup even though the key exists in the lookupWhen using batch ingestion we do not witness the behaviors or when we set
"maxRowsPerSegment"
to1
in the Kafka Tuning Config.See below for more detail.
Affected Version
0.16
and0.17
Description
Cluster size
Single using the quickstart
bin/start-single-server-small
Configurations in use
Using the default configuration located here:
conf/druid/single-server/small
However, using MySQL for the metadata storage and enabling globally cached lookups.
Steps to reproduce the problem
Setup a Kafka Supervisor
Note
"maxRowsPerSegment":3
intuningConfig
, we will refer to this laterSetup Lookups
Place the contents of below
into
/tmp/currencies.csv
Then create the lookup:
Running Through the Problem
Using Kafkacat execute the following two commands to produce the messages:
Then run the following query:
and you will see the results of below:
If you then run the following query which uses an aggregate function of
SUM
You will see the results below:
From the results of Query 1 the values in the column of
concat_expression
wrap the value with["
and"]
. Using the LTRIM and RTRIM functions to trim["
does not have any impact on the value from the expression.From the results of Query 2 are not wrapped with
["
and"]
.As part of our query, we want to use the
concat_expression
result in a lookup. If we run the following queryWe get the following result:
With the
Converted Total Value
incorrectly returning 0 however the values of the lookups are being returned for the givencurrency
when not being used in tandem with theSUM
expression.Moving on.
If you then produce another message of:
And then rerun the below query
You will see:
From the results of Query 3 after 3 values have been ingested the values in the column of
concat_expression
are no longer wrapped with["
and"]
.If we then re-run the query below
We get the following
With
Converted Total Value
returning the expected results.We believe this could be due to that when two rows have been ingested they are aggregated in heap memory however when the third row gets ingested it gets persisted to the segment due to setting "maxRowsPerSegment" to 3.
Any debugging that you have already done
The only debugging that has been carried out is changing the tuning config values, reducing
intermediatePersistPeriod
ormaxRowsPerSegment
to persist the rows to the segments.