apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.4k stars 3.68k forks source link

Compaction task failed, how can I troubleshoot? #13385

Open kaojunsong opened 1 year ago

kaojunsong commented 1 year ago

Hi experts,

I am using druid 0.22.1, and I want to compact some small segments, so I submit a compaction task:

{
    "type": "compact",
    "dataSource": "autotest",
    "ioConfig": {
        "type": "compact",
        "inputSpec": {
            "type": "interval",
            "interval": "1970-01-01T00:00:00Z/2023-11-18T07:14:34.000Z"
        },
        "dropExisting": true
    },
    "granularitySpec": {
        "segmentGranularity": "MINUTE",
        "queryGranularity": "MINUTE"
    }
}

But the compaction task failed with the following status information

{
  "id": "compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z",
  "groupId": "compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z",
  "type": "compact",
  "createdTime": "2022-11-18T08:41:49.908Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",
  "duration": 27628,
  "location": {
    "host": "172.24.131.32",
    "port": -1,
    "tlsPort": 8100
  },
  "dataSource": "autotest",
  "errorMsg": "Ran [1] specs, [0] succeeded, [1] failed"
}

And the logs:

{"instant":{"epochSecond":1668760935,"nanoOfSecond":19000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.storage.s3.S3DataSegmentPuller","message":"Pulling index at path[CloudObjectLocation{bucket='cerberus-zontal-ingest', path='druid/segments/autotest/2022-11-18T07:12:33.000Z_2022-11-18T07:12:34.000Z/2022-11-18T07:12:34.855Z/0/index.zip'}] to outDir[var/druid/task/compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z/work/autotest/2022-11-18T07:12:33.000Z_2022-11-18T07:12:34.000Z/2022-11-18T07:12:34.855Z/0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760935,"nanoOfSecond":217000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.storage.s3.S3DataSegmentPuller","message":"Loaded 370186 bytes from [CloudObjectLocation{bucket='cerberus-zontal-ingest', path='druid/segments/autotest/2022-11-18T07:12:33.000Z_2022-11-18T07:12:34.000Z/2022-11-18T07:12:34.855Z/0/index.zip'}] to [/opt/druid/var/druid/task/compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z/work/autotest/2022-11-18T07:12:33.000Z_2022-11-18T07:12:34.000Z/2022-11-18T07:12:34.855Z/0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760935,"nanoOfSecond":218000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.storage.s3.S3DataSegmentPuller","message":"Pulling index at path[CloudObjectLocation{bucket='cerberus-zontal-ingest', path='druid/segments/autotest/2022-11-18T07:12:41.000Z_2022-11-18T07:12:42.000Z/2022-11-18T07:12:42.687Z/0/index.zip'}] to outDir[var/druid/task/compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z/work/autotest/2022-11-18T07:12:41.000Z_2022-11-18T07:12:42.000Z/2022-11-18T07:12:42.687Z/0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760935,"nanoOfSecond":318000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.storage.s3.S3DataSegmentPuller","message":"Loaded 370186 bytes from [CloudObjectLocation{bucket='cerberus-zontal-ingest', path='druid/segments/autotest/2022-11-18T07:12:41.000Z_2022-11-18T07:12:42.000Z/2022-11-18T07:12:42.687Z/0/index.zip'}] to [/opt/druid/var/druid/task/compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z/work/autotest/2022-11-18T07:12:41.000Z_2022-11-18T07:12:42.000Z/2022-11-18T07:12:42.687Z/0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760935,"nanoOfSecond":319000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.storage.s3.S3DataSegmentPuller","message":"Pulling index at path[CloudObjectLocation{bucket='cerberus-zontal-ingest', path='druid/segments/autotest/2022-11-18T07:13:34.000Z_2022-11-18T07:13:35.000Z/2022-11-18T07:13:36.349Z/0/index.zip'}] to outDir[var/druid/task/compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z/work/autotest/2022-11-18T07:13:34.000Z_2022-11-18T07:13:35.000Z/2022-11-18T07:13:36.349Z/0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760935,"nanoOfSecond":414000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.storage.s3.S3DataSegmentPuller","message":"Loaded 370186 bytes from [CloudObjectLocation{bucket='cerberus-zontal-ingest', path='druid/segments/autotest/2022-11-18T07:13:34.000Z_2022-11-18T07:13:35.000Z/2022-11-18T07:13:36.349Z/0/index.zip'}] to [/opt/druid/var/druid/task/compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z/work/autotest/2022-11-18T07:13:34.000Z_2022-11-18T07:13:35.000Z/2022-11-18T07:13:36.349Z/0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760935,"nanoOfSecond":816000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.indexing.common.task.CompactionTask","message":"Generate compaction task spec with new query granularity overrided from input [{type=period, period=PT1M, timeZone=UTC, origin=null}]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":502000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.indexing.common.task.CompactionTask","message":"Generated [1] compaction task specs","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":617000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.indexing.common.task.AbstractBatchIndexTask","message":"[forceTimeChunkLock] is set to true in task context. Use timeChunk lock","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":703000000},"thread":"task-runner-0-priority-0","level":"WARN","loggerName":"org.apache.druid.indexing.common.task.CompactionTask","message":"indexSpec is not ready: [{\n  \"type\" : \"index_parallel\",\n  \"id\" : \"compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z\",\n  \"groupId\" : \"compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z\",\n  \"resource\" : {\n    \"availabilityGroup\" : \"compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z\",\n    \"requiredCapacity\" : 1\n  },\n  \"spec\" : {\n    \"dataSchema\" : {\n      \"dataSource\" : \"autotest\",\n      \"timestampSpec\" : {\n        \"column\" : \"__time\",\n        \"format\" : \"millis\",\n        \"missingValue\" : null\n      },\n      \"dimensionsSpec\" : {\n        \"dimensions\" : [ {\n          \"type\" : \"string\",\n          \"name\" : \"content_identifier\",\n          \"multiValueHandling\" : \"SORTED_ARRAY\",\n          \"createBitmapIndex\" : true\n        }, {\n          \"type\" : \"string\",\n          \"name\" : \"information_package_iri\",\n          \"multiValueHandling\" : \"SORTED_ARRAY\",\n          \"createBitmapIndex\" : true\n        }, {\n          \"type\" : \"string\",\n          \"name\" : \"iri\",\n          \"multiValueHandling\" : \"SORTED_ARRAY\",\n          \"createBitmapIndex\" : true\n        }, {\n          \"type\" : \"long\",\n          \"name\" : \"x1_value_int\",\n          \"multiValueHandling\" : \"SORTED_ARRAY\",\n          \"createBitmapIndex\" : false\n        }, {\n          \"type\" : \"long\",\n          \"name\" : \"y1_value_int\",\n          \"multiValueHandling\" : \"SORTED_ARRAY\",\n          \"createBitmapIndex\" : false\n        } ],\n        \"dimensionExclusions\" : [ \"__time\" ]\n      },\n      \"metricsSpec\" : [ ],\n      \"granularitySpec\" : {\n        \"type\" : \"uniform\",\n        \"segmentGranularity\" : \"MINUTE\",\n        \"queryGranularity\" : \"MINUTE\",\n        \"rollup\" : false,\n        \"intervals\" : [ \"2022-11-17T07:51:00.000Z/2022-11-18T07:13:35.000Z\" ]\n      },\n      \"transformSpec\" : {\n        \"filter\" : null,\n        \"transforms\" : [ ]\n      }\n    },\n    \"ioConfig\" : {\n      \"type\" : \"index_parallel\",\n      \"inputSource\" : {\n        \"type\" : \"druid\",\n        \"dataSource\" : \"autotest\",\n        \"interval\" : \"1970-01-01T00:00:00.000Z/2022-11-18T07:14:34.000Z\"\n      },\n      \"inputFormat\" : null,\n      \"appendToExisting\" : false,\n      \"dropExisting\" : true\n    },\n    \"tuningConfig\" : {\n      \"type\" : \"compaction\",\n      \"maxRowsPerSegment\" : 5000000,\n      \"appendableIndexSpec\" : {\n        \"type\" : \"onheap\"\n      },\n      \"maxRowsInMemory\" : 1000000,\n      \"maxBytesInMemory\" : 0,\n      \"skipBytesInMemoryOverheadCheck\" : false,\n      \"maxTotalRows\" : 9223372036854775807,\n      \"numShards\" : null,\n      \"splitHintSpec\" : null,\n      \"partitionsSpec\" : {\n        \"type\" : \"dynamic\",\n        \"maxRowsPerSegment\" : 5000000,\n        \"maxTotalRows\" : 9223372036854775807\n      },\n      \"indexSpec\" : {\n        \"bitmap\" : {\n          \"type\" : \"roaring\",\n          \"compressRunOnSerialization\" : true\n        },\n        \"dimensionCompression\" : \"lz4\",\n        \"metricCompression\" : \"lz4\",\n        \"longEncoding\" : \"longs\",\n        \"segmentLoader\" : null\n      },\n      \"indexSpecForIntermediatePersists\" : {\n        \"bitmap\" : {\n          \"type\" : \"roaring\",\n          \"compressRunOnSerialization\" : true\n        },\n        \"dimensionCompression\" : \"lz4\",\n        \"metricCompression\" : \"lz4\",\n        \"longEncoding\" : \"longs\",\n        \"segmentLoader\" : null\n      },\n      \"maxPendingPersists\" : 0,\n      \"forceGuaranteedRollup\" : false,\n      \"reportParseExceptions\" : false,\n      \"pushTimeout\" : 0,\n      \"segmentWriteOutMediumFactory\" : null,\n      \"maxNumConcurrentSubTasks\" : 1,\n      \"maxRetry\" : 3,\n      \"taskStatusCheckPeriodMs\" : 1000,\n      \"chatHandlerTimeout\" : \"PT10S\",\n      \"chatHandlerNumRetries\" : 5,\n      \"maxNumSegmentsToMerge\" : 100,\n      \"totalNumMergeTasks\" : 10,\n      \"logParseExceptions\" : false,\n      \"maxParseExceptions\" : 2147483647,\n      \"maxSavedParseExceptions\" : 0,\n      \"maxColumnsToMerge\" : -1,\n      \"awaitSegmentAvailabilityTimeoutMillis\" : 0,\n      \"partitionDimensions\" : [ ]\n    }\n  },\n  \"context\" : {\n    \"useLineageBasedSegmentAllocation\" : true,\n    \"appenderatorTrackingTaskId\" : \"compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z\",\n    \"storeCompactionState\" : true,\n    \"forceTimeChunkLock\" : true,\n    \"priority\" : 25\n  },\n  \"dataSource\" : \"autotest\"\n}].\nTrying the next indexSpec.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":703000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.indexing.common.task.CompactionTask","message":"Ran [1] specs, [0] succeeded, [1] failed","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":705000000},"thread":"task-runner-0-priority-0","level":"INFO","loggerName":"org.apache.druid.indexing.worker.executor.ExecutorLifecycle","message":"Task completed with status: {\n  \"id\" : \"compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z\",\n  \"status\" : \"FAILED\",\n  \"duration\" : 6798,\n  \"errorMsg\" : \"Ran [1] specs, [0] succeeded, [1] failed\",\n  \"location\" : {\n    \"host\" : null,\n    \"port\" : -1,\n    \"tlsPort\" : -1\n  }\n}","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":106,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":710000000},"thread":"main","level":"INFO","loggerName":"org.apache.druid.java.util.common.lifecycle.Lifecycle","message":"Stopping lifecycle [module] stage [ANNOUNCEMENTS]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":711000000},"thread":"main","level":"INFO","loggerName":"org.apache.druid.java.util.common.lifecycle.Lifecycle","message":"Stopping lifecycle [module] stage [SERVER]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":716000000},"thread":"main","level":"INFO","loggerName":"org.eclipse.jetty.server.AbstractConnector","message":"Stopped ServerConnector@58bad46f{SSL, (ssl, http/1.1)}{0.0.0.0:8100}","endOfBatch":false,"loggerFqcn":"org.eclipse.jetty.util.log.Slf4jLog","threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":717000000},"thread":"main","level":"INFO","loggerName":"org.eclipse.jetty.server.session","message":"node0 Stopped scavenging","endOfBatch":false,"loggerFqcn":"org.eclipse.jetty.util.log.Slf4jLog","threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":802000000},"thread":"main","level":"INFO","loggerName":"org.eclipse.jetty.server.handler.ContextHandler","message":"Stopped o.e.j.s.ServletContextHandler@7c4ca87c{/,null,STOPPED}","endOfBatch":false,"loggerFqcn":"org.eclipse.jetty.util.log.Slf4jLog","threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":805000000},"thread":"main","level":"INFO","loggerName":"org.apache.druid.java.util.common.lifecycle.Lifecycle","message":"Stopping lifecycle [module] stage [NORMAL]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":806000000},"thread":"main","level":"INFO","loggerName":"org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner","message":"Starting graceful shutdown of task[compact_autotest_ddbohokd_2022-11-18T08:41:49.906Z].","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":816000000},"thread":"LookupExtractorFactoryContainerProvider-MainThread","level":"INFO","loggerName":"org.apache.druid.query.lookup.LookupReferencesManager","message":"Lookup Management loop exited. Lookup notices are not handled anymore.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":103,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":816000000},"thread":"main","level":"INFO","loggerName":"org.apache.druid.query.lookup.LookupReferencesManager","message":"Closed lookup [deleted_information_packages].","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1668760936,"nanoOfSecond":821000000},"thread":"Curator-Framework-0","level":"INFO","loggerName":"org.apache.curator.framework.imps.CuratorFrameworkImpl","message":"backgroundOperationsLoop exiting","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":99,"threadPriority":5}
{"instant":{"epochSecond":1668760937,"nanoOfSecond":18000000},"thread":"main","level":"INFO","loggerName":"org.apache.zookeeper.ZooKeeper","message":"Session: 0x1001892b7e9010b closed","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1668760937,"nanoOfSecond":18000000},"thread":"main-EventThread","level":"INFO","loggerName":"org.apache.zookeeper.ClientCnxn","message":"EventThread shut down for session: 0x1001892b7e9010b","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":98,"threadPriority":5}
{"instant":{"epochSecond":1668760937,"nanoOfSecond":105000000},"thread":"main","level":"INFO","loggerName":"org.apache.druid.java.util.common.lifecycle.Lifecycle","message":"Stopping lifecycle [module] stage [INIT]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}
Finished peon task

I notice there is a "indexSpec is not ready:" warning in the log, but I don't know what does this mean. FYI: The datasource is "Fully available" Can anyone help me? Or how can I troubleshout issues like this? Thanks a lot in advance.

kaojunsong commented 1 year ago

I searched in the Slack, and found that someone suggest to use

   "context": {
        "forceTimeChunkLock": false
    }

And the compaction task works after I added this setting. But I don't know if there is any side effect if I add this setting.

gianm commented 1 year ago

The indexSpec is not ready warning happens if the task is unable to acquire a lock on its interval. The interval is 1970-01-01T00:00:00Z/2023-11-18T07:14:34.000Z (from your compact task spec) so it would be trying to get an exclusive lock for that interval. Is it possible some other task was running at the time? If so, that would cause the behavior you saw. You can retry later when no other tasks are running.

And the compaction task works after I added this setting. But I don't know if there is any side effect if I add this setting.

forceTimeChunkLock: false is an experimental setting which you can read more about here: https://druid.apache.org/docs/latest/ingestion/tasks.html#locking. It makes the locking more fine-grained, so avoids the lock contention issue that you are probably hitting. It is experimental, though, so I'd only recommend it if you are comfortable with a feature that is not currently viewed as production-ready.

gianm commented 1 year ago

An update — I believe this is the same bug as described here: https://github.com/apache/druid/issues/13540#issuecomment-1344899192.