Open tarpdalton opened 4 years ago
@tarpdalton thank you for the detailed report! I don't have a concrete idea to fix the bug right now, but will take a look.
@jihoonson I don't understand what's the meaning of setting timeZone
, origin
for segmentGranularity
, and I don't see any document about this. There is another segmentGranularity
setting problem #9894 .
@FrankChen021 it's documented here. #9894 is about duration
segment granularity and doesn't seem related to this issue.
@FrankChen021 it's documented here. #9894 is about
duration
segment granularity and doesn't seem related to this issue.
The doc is about query granularity, although segment granularity shares the same type as query granularity, it does not explain why people need to care about timezone/origin of segment granularity. I don’t see any beneficial from these two parameters on segment granularity
Yes, the doc should say it can be used for segment granularity as well. However, it is at least linked https://druid.apache.org/docs/latest/ingestion/index.html#granularityspec.
The doc is about query granularity, although segment granularity shares the same type as query granularity, it does not explain why people need to care about timezone/origin of segment granularity. I don’t see any beneficial from these two parameters on segment granularity
I'm not sure what you are suggesting. The timezone is useful when you have timestamps of a different timezone from the one where your druid is running. The origin is useful when you want to make time buckets differently.
I'll share my use case for segment granularity. Here is my granularity spec for loading some data:
"granularitySpec": {
"segmentGranularity": {
"type": "period",
"period": "P1D",
"timeZone": "America/New_York"
},
"queryGranularity": {
"type": "period",
"period": "P1D",
"timeZone": "America/New_York"
},
"rollup": true,
"intervals": [
"2020-05-12T00:00:00-04:00/2020-05-13T00:00:00-04:00"
]
},
I am rolling up in daily buckets, but offset by the timezone. The granularity is big so the roll up is more efficient. The event data that I am storing in druid occurs in the EST/EDT timezone. When I query druid to see how many events happened March 12th; I want to see events from March 12th EDT, not March 12th UTC.
@tarpdalton I see there's some benefits by setting timezone for segment granularity. Each segment starts at 00:00 EDT instead of UTC, to query data within a local day exactly falls into one segment. But if segment starts at 00:00 +0, data in local time may spread in two segments.
We are also facing similar issue, We are using druid 0.20.2. Issue started coming when we moved from ec2 r5.4x instance to i3en.4x instance
` "tuningConfig": { "type": "index_parallel", "splitHintSpec": { "type": "maxSize", "maxNumFiles": 2 }, "partitionsSpec" : { "type" : "hashed", "numShards": 35 },
"forceGuaranteedRollup": true,
"totalNumMergeTasks": 100,
"maxNumSegmentsToMerge": 100,
"maxNumConcurrentSubTasks": 500,
"maxRowsInMemory": 3000000,
"maxPendingPersists": 1,
"useCombiner" : true,
"forceExtendableShardSpecs" : true,
"indexSpec": {
"bitmap": {
"type": "roaring"
},
"dimensionCompression": "lz4",
"metricCompression": "lz4"
}
}`
2022-07-05T18:36:46,245 ERROR [[partial_index_generic_merge_datasource_mngkoeoe_2022-07-05T18:36:44.742Z]-threading-task-runner-executor-3] org.apache.druid.indexing.overlord.ThreadingTaskRunner - Exception caught while running the task. java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) ~[?:1.8.0_282] at java.util.zip.ZipFile.<init>(ZipFile.java:225) ~[?:1.8.0_282] at java.util.zip.ZipFile.<init>(ZipFile.java:155) ~[?:1.8.0_282] at java.util.zip.ZipFile.<init>(ZipFile.java:169) ~[?:1.8.0_282] at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:250) ~[druid-core-0.20.1.jar:0.20.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.fetchSegmentFiles(PartialSegmentMergeTask.java:220) ~[druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.runTask(PartialSegmentMergeTask.java:158) ~[druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialGenericSegmentMergeTask.runTask(PartialGenericSegmentMergeTask.java:41) ~[druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:140) ~[druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:211) [druid-indexing-service-0.20.1.jar:0.20.1] at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:151) [druid-indexing-service-0.20.1.jar:0.20.1] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2022-07-05T18:36:46,246 ERROR [threading-task-runner-executor-3] org.apache.druid.segment.realtime.appenderator.UnifiedIndexerAppenderatorsManager - Could not find datasource bundle for [datasource], task [partial_index_generic_merge_datasource_mngkoeoe_2022-07-05T18:36:44.742Z]
Affected Version
0.18.0 and 0.18.1
Description
Cluster size
Steps to reproduce the problem
index_parallel
tasktimeZone
in thesegmentGranularity
in thegranularitySpec
in thedataSchema
maxNumConcurrentSubTasks
greater than1
in thetuningConfig
type
ashashed
forpartitionsSpec
intuningConfig
The error message or stack traces encountered.
The main error is the
ZipException
The unzip fails because
findPartitionFile
fails to find the partition created during thepartial_index_generate
task.getPartition
returns the error message instead of the zip file. So the unzip fails.The partition file is stored with the timezone offset in the path like this:
2020-04-24T00:00:00.000-04:00/2020-04-25T00:00:00.000-04:00
But the http request to
getPartition
uses the UTC timestartTime=2020-04-24T04:00:00.000Z&endTime=2020-04-25T04:00:00.000Z
Any debugging that you have already done
I'm not very familiar with the druid code so I'm not sure if there is a simple code fix. @jihoonson might know how to fix it, since he is working on https://github.com/apache/druid/issues/8061.
It looks like
startTime
andendTime
param args are fromMaybe you could store
interval
with the tz offset instead of the materialized UTC time?