apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.51k stars 3.7k forks source link

The SegmentMetadata query returns the thetaSketch column type incorrectly in real-time ingestion range #16982

Open jamangstangs opened 2 months ago

jamangstangs commented 2 months ago

Environment

Description

Using Kafka ingestion and submitting the ingestion task as follows.

...
    "metricsSpec": [
      {
        "name": "uniq_column1",
        "type": "thetaSketch",
        "fieldName": "uniq_column1",
        "size": 16384
      },
      {
        "name": "uniq_column1",
        "type": "thetaSketch",
        "fieldName": "uniq_column1",
        "size": 16384
      },
    ]
...
    "tuningConfig": {
      "type": "kafka",
      "maxRowsPerSegment": 1000000000,
      "maxTotalRows": 1000000000,
      "maxBytesInMemory": -1
    },
...
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "SECOND",
      "rollup": true
    }
...
    "taskDuration": "PT1H"

When use segment metadata query, thetaSketch type column return type and typeSignature as STRING type. Not the thetaSketch type.

{
      queryType: "segmentMetadata",
      dataSource: "datasource",
      merge: true
}
column typeSignature type errorMessage
uniq_column1 STRING STRING error:cannot_merge_diff_types: [thetaSketch] and [thetaSketchBuild]
uniq_column2 STRING STRING error:cannot_merge_diff_types: [thetaSketch] and [thetaSketchBuild]

But, when I set the range of the segment metadata query to exclude the real-time ingestion range, it returns the correct type.

{
      queryType: "segmentMetadata",
      dataSource: "datasource",
      merge: true,
      intervals:["2024-08-30T04:00:00.000Z/2024-09-01T23:00:00.000Z"]
}
column typeSignature type errorMessage
uniq_column1 COMPLEX\<thetaSketch> thetaSketch null
uniq_column2 COMPLEX\<thetaSketch> thetaSketch null

I'm also using version 0.21.0 of the Druid cluster, and when I test the same type of query, it returns the correct type.

{
      queryType: "segmentMetadata",
      dataSource: "datasource",
      merge: true
}
column type errorMessage
uniq_column1 thetaSketch null
uniq_column2 thetaSketch null

It seems particularly unable to merge in the real-time ingestion range for thetaSketch type. This kind of issue already fixed in https://github.com/apache/druid/issues/3339, but still affected in version 26.0.0.

Is there a solution for this, or has it been fixed in a newer version of the Druid cluster?

cryptoe commented 2 months ago

@findingrish Is this something you can take a look into ?

jamangstangs commented 3 weeks ago

Test with druid 30.0.0, but still have an issue