apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.48k stars 3.7k forks source link

index_parallel task fails if segmentGranularity has a timeZone #9993

Open tarpdalton opened 4 years ago

tarpdalton commented 4 years ago

Affected Version

0.18.0 and 0.18.1

Description

Cluster size

Steps to reproduce the problem

The error message or stack traces encountered.

The main error is the ZipException

2020-06-04T23:39:20,955 INFO [task-runner-0-priority-0] org.apache.druid.utils.CompressionUtils - Unzipping file[var/druid/task/partial_index_merge_datasource_1_geoeiplm_2020-06-04T23:39:16.988Z/work/indexing-tmp/2020-04-24T04:00:00.000Z/2020-04-25T04:00:00.000Z/1/temp_partial_index_generate_datasource_1_ieoldkdf_2020-06-04T23:39:01.964Z] to [var/druid/task/partial_index_merge_datasource_1_geoeiplm_2020-06-04T23:39:16.988Z/work/indexing-tmp/2020-04-24T04:00:00.000Z/2020-04-25T04:00:00.000Z/1/unzipped_partial_index_generate_datasource_1_ieoldkdf_2020-06-04T23:39:01.964Z]
2020-06-04T23:39:20,956 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Exception while running task[AbstractTask{id='partial_index_merge_datasource_1_geoeiplm_2020-06-04T23:39:16.988Z', groupId='index_parallel_datasource_1_jjglpmkc_2020-06-04T23:38:57.541Z', taskResource=TaskResource{availabilityGroup='partial_index_merge_datasource_1_geoeiplm_2020-06-04T23:39:16.988Z', requiredCapacity=1}, dataSource='datasource_1', context={forceTimeChunkLock=true}}]
java.util.zip.ZipException: error in opening zip file
    at java.util.zip.ZipFile.open(Native Method) ~[?:1.8.0_252]
    at java.util.zip.ZipFile.<init>(ZipFile.java:225) ~[?:1.8.0_252]
    at java.util.zip.ZipFile.<init>(ZipFile.java:155) ~[?:1.8.0_252]
    at java.util.zip.ZipFile.<init>(ZipFile.java:169) ~[?:1.8.0_252]
    at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:250) ~[druid-core-0.18.1.jar:0.18.1]
    at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.fetchSegmentFiles(PartialSegmentMergeTask.java:231) ~[druid-indexing-service-0.18.1.jar:0.18.1]
    at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.runTask(PartialSegmentMergeTask.java:169) ~[druid-indexing-service-0.18.1.jar:0.18.1]
    at org.apache.druid.indexing.common.task.batch.parallel.PartialHashSegmentMergeTask.runTask(PartialHashSegmentMergeTask.java:44) ~[druid-indexing-service-0.18.1.jar:0.18.1]
    at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:123) ~[druid-indexing-service-0.18.1.jar:0.18.1]
    at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:421) [druid-indexing-service-0.18.1.jar:0.18.1]
    at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:393) [druid-indexing-service-0.18.1.jar:0.18.1]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_252]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

The unzip fails because findPartitionFile fails to find the partition created during the partial_index_generate task. getPartition returns the error message instead of the zip file. So the unzip fails.

The partition file is stored with the timezone offset in the path like this: 2020-04-24T00:00:00.000-04:00/2020-04-25T00:00:00.000-04:00

/tmp/intermediary-segments/index_parallel_datasource_1_iiocmdme_2020-06-04T23:15:56.314Z/2020-04-24T00:00:00.000-04:00/2020-04-25T00:00:00.000-04:00/1/partial_index_generate_datasource_1_cgdlipdp_2020-06-04T23:16:02.960Z

But the http request to getPartition uses the UTC time startTime=2020-04-24T04:00:00.000Z&endTime=2020-04-25T04:00:00.000Z

2020-06-04T23:39:20,945 DEBUG [HttpClient-Netty-Worker-0] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://<hostname_removed>:8091/druid/worker/v1/shuffle/task/index_parallel_datasource_1_jjglpmkc_2020-06-04T23%3A38%3A57.541Z/partial_index_generate_datasource_1_ieoldkdf_2020-06-04T23%3A39%3A01.964Z/partition?startTime=2020-04-24T04:00:00.000Z&endTime=2020-04-25T04:00:00.000Z&partitionId=1] Got response: 404 Not Found

Any debugging that you have already done

I'm not very familiar with the druid code so I'm not sure if there is a simple code fix. @jihoonson might know how to fix it, since he is working on https://github.com/apache/druid/issues/8061.

It looks like startTime and endTime param args are from

partial_index_merge
  spec
    ioConfig
      partitionLocations
        interval

Maybe you could store interval with the tz offset instead of the materialized UTC time?

jihoonson commented 4 years ago

@tarpdalton thank you for the detailed report! I don't have a concrete idea to fix the bug right now, but will take a look.

FrankChen021 commented 4 years ago

@jihoonson I don't understand what's the meaning of setting timeZone, origin for segmentGranularity, and I don't see any document about this. There is another segmentGranularity setting problem #9894 .

jihoonson commented 4 years ago

@FrankChen021 it's documented here. #9894 is about duration segment granularity and doesn't seem related to this issue.

FrankChen021 commented 4 years ago

@FrankChen021 it's documented here. #9894 is about duration segment granularity and doesn't seem related to this issue.

The doc is about query granularity, although segment granularity shares the same type as query granularity, it does not explain why people need to care about timezone/origin of segment granularity. I don’t see any beneficial from these two parameters on segment granularity

jihoonson commented 4 years ago

Yes, the doc should say it can be used for segment granularity as well. However, it is at least linked https://druid.apache.org/docs/latest/ingestion/index.html#granularityspec.

The doc is about query granularity, although segment granularity shares the same type as query granularity, it does not explain why people need to care about timezone/origin of segment granularity. I don’t see any beneficial from these two parameters on segment granularity

I'm not sure what you are suggesting. The timezone is useful when you have timestamps of a different timezone from the one where your druid is running. The origin is useful when you want to make time buckets differently.

tarpdalton commented 4 years ago

I'll share my use case for segment granularity. Here is my granularity spec for loading some data:

      "granularitySpec": {
        "segmentGranularity": {
          "type": "period",
          "period": "P1D",
          "timeZone": "America/New_York"
        },
        "queryGranularity": {
          "type": "period",
          "period": "P1D",
          "timeZone": "America/New_York"
        },
        "rollup": true,
        "intervals": [
          "2020-05-12T00:00:00-04:00/2020-05-13T00:00:00-04:00"
        ]
      },

I am rolling up in daily buckets, but offset by the timezone. The granularity is big so the roll up is more efficient. The event data that I am storing in druid occurs in the EST/EDT timezone. When I query druid to see how many events happened March 12th; I want to see events from March 12th EDT, not March 12th UTC.

FrankChen021 commented 4 years ago

@tarpdalton I see there's some benefits by setting timezone for segment granularity. Each segment starts at 00:00 EDT instead of UTC, to query data within a local day exactly falls into one segment. But if segment starts at 00:00 +0, data in local time may spread in two segments.

harshmohta commented 2 years ago

We are also facing similar issue, We are using druid 0.20.2. Issue started coming when we moved from ec2 r5.4x instance to i3en.4x instance

` "tuningConfig": { "type": "index_parallel", "splitHintSpec": { "type": "maxSize", "maxNumFiles": 2 }, "partitionsSpec" : { "type" : "hashed", "numShards": 35 },

  "forceGuaranteedRollup": true,
  "totalNumMergeTasks": 100,
  "maxNumSegmentsToMerge": 100,
  "maxNumConcurrentSubTasks": 500,
  "maxRowsInMemory": 3000000,
  "maxPendingPersists": 1,
  "useCombiner" : true,
  "forceExtendableShardSpecs" : true,
  "indexSpec": {
    "bitmap": {
      "type": "roaring"
    },
    "dimensionCompression": "lz4",
    "metricCompression": "lz4"
  }
}`

2022-07-05T18:36:46,245 ERROR [[partial_index_generic_merge_datasource_mngkoeoe_2022-07-05T18:36:44.742Z]-threading-task-runner-executor-3] org.apache.druid.indexing.overlord.ThreadingTaskRunner - Exception caught while running the task. java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) ~[?:1.8.0_282] at java.util.zip.ZipFile.<init>(ZipFile.java:225) ~[?:1.8.0_282] at java.util.zip.ZipFile.<init>(ZipFile.java:155) ~[?:1.8.0_282] at java.util.zip.ZipFile.<init>(ZipFile.java:169) ~[?:1.8.0_282] at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:250) ~[druid-core-0.20.1.jar:0.20.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.fetchSegmentFiles(PartialSegmentMergeTask.java:220) ~[druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.runTask(PartialSegmentMergeTask.java:158) ~[druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialGenericSegmentMergeTask.runTask(PartialGenericSegmentMergeTask.java:41) ~[druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:140) ~[druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:211) [druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:151) [druid-indexing-service-0.20.1.jar:0.20.1] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2022-07-05T18:36:46,246 ERROR [threading-task-runner-executor-3] org.apache.druid.segment.realtime.appenderator.UnifiedIndexerAppenderatorsManager - Could not find datasource bundle for [datasource], task [partial_index_generic_merge_datasource_mngkoeoe_2022-07-05T18:36:44.742Z]