apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.41k stars 3.68k forks source link

Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!]. #8605

Open qianmoQ opened 4 years ago

qianmoQ commented 4 years ago

Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].

Affected Version

>=0.15.1-incubating

Description

2019-09-28T12:17:34,157 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception while running task.
java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].
    at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-16.0.1.jar:?]
    at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-16.0.1.jar:?]
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]
    at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:753) ~[druid-indexing-service-0.15.1-incubating.jar:0.15.1-incubating]
    at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:246) [druid-indexing-service-0.15.1-incubating.jar:0.15.1-incubating]
    at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:167) [druid-indexing-service-0.15.1-incubating.jar:0.15.1-incubating]
    at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:419) [druid-indexing-service-0.15.1-incubating.jar:0.15.1-incubating]
    at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:391) [druid-indexing-service-0.15.1-incubating.jar:0.15.1-incubating]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].
    at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:602) ~[druid-server-0.15.1-incubating.jar:0.15.1-incubating]
    ... 4 more
vogievetsky commented 4 years ago

What ingestion method are you using? Kafka or Kinesis?

qianmoQ commented 4 years ago

kafka index service

Stephan3555 commented 4 years ago

Running in the same problem lately. Have you found the reason?

qianmoQ commented 4 years ago

Running in the same problem lately. Have you found the reason?

I'm not using the apache version anymore, I'm using the IO druid version now, and that's no longer a problem

michaelsidharta commented 4 years ago

Any way to reproduce this? Somehow encountered this issue and unable to reproduce. What I found out later was my exact datasource is unused (unable to persist into deep storage) Using kafka index service, druid version 0.15.1

org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].
    at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:602) ~[druid-server-0.15.1-incubating.jar:0.15.1-incubating]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
2019-11-08T06:56:35,879 ERROR [publish-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Error while publishing segments for sequenceNumber[SequenceMetadata{sequenceId=0, sequenceName='index_kafka_shop_stats_product_view_bef122043a00004_0', assignments=[], startOffsets={0=1015942}, exclusiveStartPartitions=[], endOffsets={0=1025942}, sentinel=false, checkpointed=true}]
org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].
    at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:602) ~[druid-server-0.15.1-incubating.jar:0.15.1-incubating]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
qianmoQ commented 4 years ago

Any way to reproduce this? Somehow encountered this issue and unable to reproduce. What I found out later was my exact datasource is unused (unable to persist into deep storage) Using kafka index service, druid version 0.15.1

org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].
  at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:602) ~[druid-server-0.15.1-incubating.jar:0.15.1-incubating]
  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_151]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
2019-11-08T06:56:35,879 ERROR [publish-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Error while publishing segments for sequenceNumber[SequenceMetadata{sequenceId=0, sequenceName='index_kafka_shop_stats_product_view_bef122043a00004_0', assignments=[], startOffsets={0=1015942}, exclusiveStartPartitions=[], endOffsets={0=1025942}, sentinel=false, checkpointed=true}]
org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].
  at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:602) ~[druid-server-0.15.1-incubating.jar:0.15.1-incubating]
  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_151]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

this is a bug, kafka index service generate Multiple tasks, However, when tasks were merged, multiple segments in the same group failed to be merged, and all the data of the current segment were lost.

Recommend the use of non-apache version, the current version of apache upgrade too fast, the function is not perfect.

thanks you!!!

mggger commented 4 years ago

@qianmoQ

Meet the same problem, where can i find the IO druid version? Thanks

qianmoQ commented 4 years ago

@qianmoQ

Meet the same problem, where can i find the IO druid version? Thanks

Which version do you need to use?

jihoonson commented 4 years ago

Hi @qianmoQ, sorry for the late reponse. If this is a real bug, it seems pretty critical and we should fix it.

this is a bug, kafka index service generate Multiple tasks, However, when tasks were merged, multiple segments in the same group failed to be merged, and all the data of the current segment were lost.

Since I've never seen this error, would you please give more details about what you found? Why and where segment merge failed?

@qianmoQ @mggger @Stephan3555 the Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!] error just means the overlord failed to publish segments into metadata store for some reason. The overlord should have logged the details of failure. Do you still have the overlord logs? If so, do you see anything interesting?

jihoonson commented 4 years ago

BTW, even if you saw this error, the new task will consume the same data from Kafka and create the same set of segments. So as long as your Kafka retention is long enough, you should be able to ingest all data.

jon-wei commented 4 years ago

However, when tasks were merged, multiple segments in the same group failed to be merged, and all the data of the current segment were lost.

@qianmoQ Thanks for the report, adding to what @jihoonson said, can you provide more details on the impact when you saw this happening? Did later tasks pick up the data that the failed task had ingested, or was there some data that was never ingested?

qianmoQ commented 4 years ago

However, when tasks were merged, multiple segments in the same group failed to be merged, and all the data of the current segment were lost.

@qianmoQ Thanks for the report, adding to what @jihoonson said, can you provide more details on the impact when you saw this happening? Did later tasks pick up the data that the failed task had ingested, or was there some data that was never ingested?

The data has been ingested, but the ingested task fails, the log only shows these errors, and the data ingested by the task is not merged by segments, resulting in the loss of all task data in the current period

jon-wei commented 4 years ago

resulting in the loss of all task data in the current period

Is the data eventually reingested by a later task?

qianmoQ commented 4 years ago

Is the data eventually reingested by a later task?

No, the state of the task at this time has been retained as Failed state. When I logged on to the server, I found that the task had been running and being processed by gc, so I could not kill it, but could only kill it manually

jon-wei commented 4 years ago

@qianmoQ

The ingestion system is designed to be resilient to individual task failures (subsequent tasks would pick up from the latest committed offsets and start reading from there, so a task failing shouldn't lead to permanent data loss).

If the data that was not ingested by the failed task is picked up by subsequent tasks, then the system is still operating within expected situations, is that consistent with your observations?

For this specific error:

Caused by: org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].

This indicates that the partition/offsets associated with the segments that the task attempted to publish were inconsistent with the checkpoints stored in metadata (e.g., another task had published data for some of the same partitions while the task that failed was running), so the publish is aborted to prevent data inconsistency issues.

The error message is unfortunately not very informative from the task log side and we plan to improve that in the future, but for now to see the full details about the error it's necessary to check the overlord logs.

That kind of situation should not happen too often, how often did you observe that kind of task failure?

If you can provide overlord log contents showing the transaction failure and the surrounding time window, and the full associated task logs, that would be very helpful for figuring out what you're seeing.

qianmoQ commented 4 years ago

thank you very much,overload not found logs,Kafka index service task not stoped,but current task data not found,not found any log from other service!

fastmind2014 commented 4 years ago

coordinator-overlord.log I encountered this problem, too. As a freshman on Druid, I've no idea how to solve it. The datas in kafka may be ingested again when next new task is running. I can query those old records before that segment fails, but after the failure those records are gone.

qianmoQ commented 4 years ago

coordinator-overlord.log I encountered this problem, too. As a freshman on Druid, I've no idea how to solve it. The datas in kafka may be ingested again when next new task is running. I can query those old records before that segment fails, but after the failure those records are gone.

The reason for this problem is that the current running task node is short of memory |CPU resources, and druid is unable to release the completed tasks, which are still resident in memory, resulting in insufficient resources. Only manually release these physical resources, you can refer to the following script for resource monitoring and release:

# /bin/bash

DRUID_RUNNING_TASKS_PIDS=`ps -ef f|grep '\_ java -cp conf/druid/_common:conf/druid/middleManager:lib'|grep -v grep|awk '{print $2}'`

CURRENT_TIMESTAMP=`date +%s`

for pid in $DRUID_RUNNING_TASKS_PIDS
do
    CUEERNT_START_TIME=`ps -p $pid -o lstart|tail -1`
    TEMP=`date -d "$CUEERNT_START_TIME" +%s`
    TIME_DIFF=$(($CURRENT_TIMESTAMP - $TEMP))
    if [[ $TIME_DIFF -gt 3600 ]]; then
       echo 'current PID $pid,start time $CUEERNT_START_TIME, timestamp$TEMP'
       kill -9 $pid
    fi
done

Please modify the 3600 in the script that is longer than the time to deploy the task. In order to ensure the data service, try to double the time of task. Add the script monitoring into the crontab of the system, for example:

*/5 * * * * /bin/sh /hadoop/data1/druid-0.12.3/druid-task-monitor.sh

Mine is a check every 5 minutes to free up resources

Stephan3555 commented 4 years ago

@jihoonson

I dont have the logs anymore but the error occurs regularly now. (I will post the logs as soon as it appears again)

Maybe its not related, but we figured out that our postgres have a very large number of non active tasks in the postgres database.

image

It seems that the old tasks are not getting cleaned from the database anymore (I think this was the case in our Druid version < 16.1)

The result of this is that the Druid UI was not responding anymore for the "Task"-Tab. Or every time the UI fetches something from the database. And the overlord quit with an Heap Space Error every time the "Task"-Tab was opened:

image

Maybe this is related to the Transaction aborts as well.

I even granted the overlord 5GB of Heap (before it was 1GB) but it was not helping. The only thing that helped was to delete all of the non active tasks in the database.

terry19850829 commented 4 years ago

Have the same issue with version 0.16.0-incubating

overlord logs:

....
2020-03-18 08:22:46,796 INFO [KafkaSupervisor-crm_bd_capacity_ds] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='crm_bd_capacity_ds', generationTime=2020-03-18T00:22:46.796Z, payload=KafkaSupervisorReportPayload{dataSource='crm_bd_capacity_ds', topic='crm.capacity.event', partitions=3, replicas=1, durationSeconds=3600, active=[{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', startTime=2020-03-17T23:23:33.411Z, remainingSeconds=46}], publishing=[], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[]}}
2020-03-18 08:23:16,796 INFO [KafkaSupervisor-crm_bd_capacity_ds] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='crm_bd_capacity_ds', generationTime=2020-03-18T00:23:16.796Z, payload=KafkaSupervisorReportPayload{dataSource='crm_bd_capacity_ds', topic='crm.capacity.event', partitions=3, replicas=1, durationSeconds=3600, active=[{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', startTime=2020-03-17T23:23:33.411Z, remainingSeconds=16}], publishing=[], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[]}}
2020-03-18 08:23:34,477 INFO [IndexTaskClient-crm_bd_capacity_ds-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Task [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] paused successfully
2020-03-18 08:23:34,515 INFO [KafkaSupervisor-crm_bd_capacity_ds] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='crm_bd_capacity_ds', generationTime=2020-03-18T00:23:34.515Z, payload=KafkaSupervisorReportPayload{dataSource='crm_bd_capacity_ds', topic='crm.capacity.event', partitions=3, replicas=1, durationSeconds=3600, active=[], publishing=[{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', startTime=2020-03-17T23:23:33.411Z, remainingSeconds=1799}], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[]}}
2020-03-18 08:23:34,872 INFO [qtp343722304-89] org.apache.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm]: SegmentTransactionalInsertAction{segmentsToBeOverwritten=null, segments=[DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-17T23:00:00.000Z_2020-03-18T00:00:00.000Z_2020-03-17T23:26:39.118Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200317T230000.000Z_20200318T000000.000Z/2020-03-17T23_26_39.118Z/0_41746007-f3d8-4bbb-ab51-c899fbced760_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=4309}, DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-18T00:00:00.000Z_2020-03-18T01:00:00.000Z_2020-03-18T00:08:10.800Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200318T000000.000Z_20200318T010000.000Z/2020-03-18T00_08_10.800Z/0_6cbb5e49-dac9-4c15-ae7f-63ca8deb7d49_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=5019}], startMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487405, 1=2487408, 2=2487404}, exclusivePartitions=[]}}, endMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamEndSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487410, 1=2487413, 2=2487408}}}}
2020-03-18 08:23:34,900 INFO [qtp343722304-89] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:34.900Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"segment/txn/failure","value":1,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskType":"index_kafka"}
2020-03-18 08:23:34,900 INFO [qtp343722304-89] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:34.900Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"task/action/run/time","value":28,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskType":"index_kafka"}
2020-03-18 08:23:34,910 INFO [qtp343722304-93] org.apache.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm]: SegmentListUsedAction{dataSource='crm_bd_capacity_ds', intervals=[2020-03-17T23:00:00.000Z/2020-03-18T01:00:00.000Z]}
2020-03-18 08:23:34,915 INFO [qtp343722304-93] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:34.914Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"task/action/run/time","value":4,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskType":"index_kafka"}
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-data-004:8091] wrote FAILED status for task [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] on [TaskLocation{host='druid-data-004', port=8100, tlsPort=-1}]
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-data-004:8091] completed task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] with status[FAILED]
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] because: [notified status change from task]
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] on worker[druid-data-004:8091]
2020-03-18 08:23:35,724 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] from activeTasks
2020-03-18 08:23:35,724 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_crm_bd_capacity_ds', dataSource='crm_bd_capacity_ds', interval=2020-03-17T23:00:00.000Z/2020-03-18T00:00:00.000Z, version='2020-03-17T23:26:39.118Z', priority=75, revoked=false}]
2020-03-18 08:23:35,737 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_crm_bd_capacity_ds', dataSource='crm_bd_capacity_ds', interval=2020-03-18T00:00:00.000Z/2020-03-18T01:00:00.000Z, version='2020-03-18T00:08:10.800Z', priority=75, revoked=false}]
2020-03-18 08:23:35,756 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm to status: TaskStatus{id=index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm, status=FAILED, duration=3606169, errorMsg=java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE: Failed to publish se...}
2020-03-18 08:23:35,766 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task done: AbstractTask{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', groupId='index_kafka_crm_bd_capacity_ds', taskResource=TaskResource{availabilityGroup='index_kafka_crm_bd_capacity_ds_374d70c186b714c', requiredCapacity=1}, dataSource='crm_bd_capacity_ds', context={forceTimeChunkLock=true, checkpoints={"0":{"0":2487405,"1":2487408,"2":2487404}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true}}
2020-03-18 08:23:35,766 INFO [Curator-PathChildrenCache-1] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:35.766Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"task/run/time","value":3606169,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskStatus":"FAILED","taskType":"index_kafka"}
2020-03-18 08:23:35,766 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task FAILED: AbstractTask{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', groupId='index_kafka_crm_bd_capacity_ds', taskResource=TaskResource{availabilityGroup='index_kafka_crm_bd_capacity_ds_374d70c186b714c', requiredCapacity=1}, dataSource='crm_bd_capacity_ds', context={forceTimeChunkLock=true, checkpoints={"0":{"0":2487405,"1":2487408,"2":2487404}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true}} (3606169 run duration)
2020-03-18 08:23:35,767 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] status changed to [FAILED].
2020-03-18 08:23:35,767 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] went bye bye.
terry19850829 commented 4 years ago

Have the same issue with version 0.16.0-incubating

overlord logs:

....
2020-03-18 08:22:46,796 INFO [KafkaSupervisor-crm_bd_capacity_ds] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='crm_bd_capacity_ds', generationTime=2020-03-18T00:22:46.796Z, payload=KafkaSupervisorReportPayload{dataSource='crm_bd_capacity_ds', topic='crm.capacity.event', partitions=3, replicas=1, durationSeconds=3600, active=[{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', startTime=2020-03-17T23:23:33.411Z, remainingSeconds=46}], publishing=[], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[]}}
2020-03-18 08:23:16,796 INFO [KafkaSupervisor-crm_bd_capacity_ds] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='crm_bd_capacity_ds', generationTime=2020-03-18T00:23:16.796Z, payload=KafkaSupervisorReportPayload{dataSource='crm_bd_capacity_ds', topic='crm.capacity.event', partitions=3, replicas=1, durationSeconds=3600, active=[{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', startTime=2020-03-17T23:23:33.411Z, remainingSeconds=16}], publishing=[], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[]}}
2020-03-18 08:23:34,477 INFO [IndexTaskClient-crm_bd_capacity_ds-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Task [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] paused successfully
2020-03-18 08:23:34,515 INFO [KafkaSupervisor-crm_bd_capacity_ds] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='crm_bd_capacity_ds', generationTime=2020-03-18T00:23:34.515Z, payload=KafkaSupervisorReportPayload{dataSource='crm_bd_capacity_ds', topic='crm.capacity.event', partitions=3, replicas=1, durationSeconds=3600, active=[], publishing=[{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', startTime=2020-03-17T23:23:33.411Z, remainingSeconds=1799}], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[]}}
2020-03-18 08:23:34,872 INFO [qtp343722304-89] org.apache.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm]: SegmentTransactionalInsertAction{segmentsToBeOverwritten=null, segments=[DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-17T23:00:00.000Z_2020-03-18T00:00:00.000Z_2020-03-17T23:26:39.118Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200317T230000.000Z_20200318T000000.000Z/2020-03-17T23_26_39.118Z/0_41746007-f3d8-4bbb-ab51-c899fbced760_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=4309}, DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-18T00:00:00.000Z_2020-03-18T01:00:00.000Z_2020-03-18T00:08:10.800Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200318T000000.000Z_20200318T010000.000Z/2020-03-18T00_08_10.800Z/0_6cbb5e49-dac9-4c15-ae7f-63ca8deb7d49_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=5019}], startMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487405, 1=2487408, 2=2487404}, exclusivePartitions=[]}}, endMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamEndSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487410, 1=2487413, 2=2487408}}}}
2020-03-18 08:23:34,900 INFO [qtp343722304-89] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:34.900Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"segment/txn/failure","value":1,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskType":"index_kafka"}
2020-03-18 08:23:34,900 INFO [qtp343722304-89] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:34.900Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"task/action/run/time","value":28,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskType":"index_kafka"}
2020-03-18 08:23:34,910 INFO [qtp343722304-93] org.apache.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm]: SegmentListUsedAction{dataSource='crm_bd_capacity_ds', intervals=[2020-03-17T23:00:00.000Z/2020-03-18T01:00:00.000Z]}
2020-03-18 08:23:34,915 INFO [qtp343722304-93] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:34.914Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"task/action/run/time","value":4,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskType":"index_kafka"}
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-data-004:8091] wrote FAILED status for task [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] on [TaskLocation{host='druid-data-004', port=8100, tlsPort=-1}]
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-data-004:8091] completed task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] with status[FAILED]
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] because: [notified status change from task]
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] on worker[druid-data-004:8091]
2020-03-18 08:23:35,724 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] from activeTasks
2020-03-18 08:23:35,724 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_crm_bd_capacity_ds', dataSource='crm_bd_capacity_ds', interval=2020-03-17T23:00:00.000Z/2020-03-18T00:00:00.000Z, version='2020-03-17T23:26:39.118Z', priority=75, revoked=false}]
2020-03-18 08:23:35,737 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_crm_bd_capacity_ds', dataSource='crm_bd_capacity_ds', interval=2020-03-18T00:00:00.000Z/2020-03-18T01:00:00.000Z, version='2020-03-18T00:08:10.800Z', priority=75, revoked=false}]
2020-03-18 08:23:35,756 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm to status: TaskStatus{id=index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm, status=FAILED, duration=3606169, errorMsg=java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE: Failed to publish se...}
2020-03-18 08:23:35,766 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task done: AbstractTask{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', groupId='index_kafka_crm_bd_capacity_ds', taskResource=TaskResource{availabilityGroup='index_kafka_crm_bd_capacity_ds_374d70c186b714c', requiredCapacity=1}, dataSource='crm_bd_capacity_ds', context={forceTimeChunkLock=true, checkpoints={"0":{"0":2487405,"1":2487408,"2":2487404}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true}}
2020-03-18 08:23:35,766 INFO [Curator-PathChildrenCache-1] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:35.766Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"task/run/time","value":3606169,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskStatus":"FAILED","taskType":"index_kafka"}
2020-03-18 08:23:35,766 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task FAILED: AbstractTask{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', groupId='index_kafka_crm_bd_capacity_ds', taskResource=TaskResource{availabilityGroup='index_kafka_crm_bd_capacity_ds_374d70c186b714c', requiredCapacity=1}, dataSource='crm_bd_capacity_ds', context={forceTimeChunkLock=true, checkpoints={"0":{"0":2487405,"1":2487408,"2":2487404}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true}} (3606169 run duration)
2020-03-18 08:23:35,767 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] status changed to [FAILED].
2020-03-18 08:23:35,767 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] went bye bye.

This error doesn't happen very often.

terry19850829 commented 4 years ago

Have the same issue with version 0.16.0-incubating

overlord logs:

....
2020-03-18 08:22:46,796 INFO [KafkaSupervisor-crm_bd_capacity_ds] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='crm_bd_capacity_ds', generationTime=2020-03-18T00:22:46.796Z, payload=KafkaSupervisorReportPayload{dataSource='crm_bd_capacity_ds', topic='crm.capacity.event', partitions=3, replicas=1, durationSeconds=3600, active=[{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', startTime=2020-03-17T23:23:33.411Z, remainingSeconds=46}], publishing=[], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[]}}
2020-03-18 08:23:16,796 INFO [KafkaSupervisor-crm_bd_capacity_ds] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='crm_bd_capacity_ds', generationTime=2020-03-18T00:23:16.796Z, payload=KafkaSupervisorReportPayload{dataSource='crm_bd_capacity_ds', topic='crm.capacity.event', partitions=3, replicas=1, durationSeconds=3600, active=[{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', startTime=2020-03-17T23:23:33.411Z, remainingSeconds=16}], publishing=[], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[]}}
2020-03-18 08:23:34,477 INFO [IndexTaskClient-crm_bd_capacity_ds-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Task [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] paused successfully
2020-03-18 08:23:34,515 INFO [KafkaSupervisor-crm_bd_capacity_ds] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='crm_bd_capacity_ds', generationTime=2020-03-18T00:23:34.515Z, payload=KafkaSupervisorReportPayload{dataSource='crm_bd_capacity_ds', topic='crm.capacity.event', partitions=3, replicas=1, durationSeconds=3600, active=[], publishing=[{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', startTime=2020-03-17T23:23:33.411Z, remainingSeconds=1799}], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[]}}
2020-03-18 08:23:34,872 INFO [qtp343722304-89] org.apache.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm]: SegmentTransactionalInsertAction{segmentsToBeOverwritten=null, segments=[DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-17T23:00:00.000Z_2020-03-18T00:00:00.000Z_2020-03-17T23:26:39.118Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200317T230000.000Z_20200318T000000.000Z/2020-03-17T23_26_39.118Z/0_41746007-f3d8-4bbb-ab51-c899fbced760_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=4309}, DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-18T00:00:00.000Z_2020-03-18T01:00:00.000Z_2020-03-18T00:08:10.800Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200318T000000.000Z_20200318T010000.000Z/2020-03-18T00_08_10.800Z/0_6cbb5e49-dac9-4c15-ae7f-63ca8deb7d49_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=5019}], startMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487405, 1=2487408, 2=2487404}, exclusivePartitions=[]}}, endMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamEndSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487410, 1=2487413, 2=2487408}}}}
2020-03-18 08:23:34,900 INFO [qtp343722304-89] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:34.900Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"segment/txn/failure","value":1,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskType":"index_kafka"}
2020-03-18 08:23:34,900 INFO [qtp343722304-89] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:34.900Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"task/action/run/time","value":28,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskType":"index_kafka"}
2020-03-18 08:23:34,910 INFO [qtp343722304-93] org.apache.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm]: SegmentListUsedAction{dataSource='crm_bd_capacity_ds', intervals=[2020-03-17T23:00:00.000Z/2020-03-18T01:00:00.000Z]}
2020-03-18 08:23:34,915 INFO [qtp343722304-93] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:34.914Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"task/action/run/time","value":4,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskType":"index_kafka"}
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-data-004:8091] wrote FAILED status for task [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] on [TaskLocation{host='druid-data-004', port=8100, tlsPort=-1}]
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-data-004:8091] completed task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] with status[FAILED]
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] because: [notified status change from task]
2020-03-18 08:23:35,718 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] on worker[druid-data-004:8091]
2020-03-18 08:23:35,724 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] from activeTasks
2020-03-18 08:23:35,724 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_crm_bd_capacity_ds', dataSource='crm_bd_capacity_ds', interval=2020-03-17T23:00:00.000Z/2020-03-18T00:00:00.000Z, version='2020-03-17T23:26:39.118Z', priority=75, revoked=false}]
2020-03-18 08:23:35,737 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_crm_bd_capacity_ds', dataSource='crm_bd_capacity_ds', interval=2020-03-18T00:00:00.000Z/2020-03-18T01:00:00.000Z, version='2020-03-18T00:08:10.800Z', priority=75, revoked=false}]
2020-03-18 08:23:35,756 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm to status: TaskStatus{id=index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm, status=FAILED, duration=3606169, errorMsg=java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE: Failed to publish se...}
2020-03-18 08:23:35,766 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task done: AbstractTask{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', groupId='index_kafka_crm_bd_capacity_ds', taskResource=TaskResource{availabilityGroup='index_kafka_crm_bd_capacity_ds_374d70c186b714c', requiredCapacity=1}, dataSource='crm_bd_capacity_ds', context={forceTimeChunkLock=true, checkpoints={"0":{"0":2487405,"1":2487408,"2":2487404}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true}}
2020-03-18 08:23:35,766 INFO [Curator-PathChildrenCache-1] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2020-03-18T00:23:35.766Z","service":"druid/overlord","host":"druid-master-001:8090","version":"0.16.0-incubating","metric":"task/run/time","value":3606169,"dataSource":"crm_bd_capacity_ds","taskId":"index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm","taskStatus":"FAILED","taskType":"index_kafka"}
2020-03-18 08:23:35,766 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task FAILED: AbstractTask{id='index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm', groupId='index_kafka_crm_bd_capacity_ds', taskResource=TaskResource{availabilityGroup='index_kafka_crm_bd_capacity_ds_374d70c186b714c', requiredCapacity=1}, dataSource='crm_bd_capacity_ds', context={forceTimeChunkLock=true, checkpoints={"0":{"0":2487405,"1":2487408,"2":2487404}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true}} (3606169 run duration)
2020-03-18 08:23:35,767 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] status changed to [FAILED].
2020-03-18 08:23:35,767 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] went bye bye.

task error log:

2020-03-18T00:23:34,863 INFO [publish-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm]: SegmentTransactionalInsertAction{segmentsToBeOverwritten=null, segments=[DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-17T23:00:00.000Z_2020-03-18T00:00:00.000Z_2020-03-17T23:26:39.118Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200317T230000.000Z_20200318T000000.000Z/2020-03-17T23_26_39.118Z/0_41746007-f3d8-4bbb-ab51-c899fbced760_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=4309}, DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-18T00:00:00.000Z_2020-03-18T01:00:00.000Z_2020-03-18T00:08:10.800Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200318T000000.000Z_20200318T010000.000Z/2020-03-18T00_08_10.800Z/0_6cbb5e49-dac9-4c15-ae7f-63ca8deb7d49_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=5019}], startMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487405, 1=2487408, 2=2487404}, exclusivePartitions=[]}}, endMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamEndSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487410, 1=2487413, 2=2487408}}}}
2020-03-18T00:23:34,868 INFO [publish-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] to overlord: [SegmentTransactionalInsertAction{segmentsToBeOverwritten=null, segments=[DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-17T23:00:00.000Z_2020-03-18T00:00:00.000Z_2020-03-17T23:26:39.118Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200317T230000.000Z_20200318T000000.000Z/2020-03-17T23_26_39.118Z/0_41746007-f3d8-4bbb-ab51-c899fbced760_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=4309}, DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-18T00:00:00.000Z_2020-03-18T01:00:00.000Z_2020-03-18T00:08:10.800Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200318T000000.000Z_20200318T010000.000Z/2020-03-18T00_08_10.800Z/0_6cbb5e49-dac9-4c15-ae7f-63ca8deb7d49_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=5019}], startMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487405, 1=2487408, 2=2487404}, exclusivePartitions=[]}}, endMetadata=KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamEndSequenceNumbers{stream='crm.capacity.event', partitionSequenceNumberMap={0=2487410, 1=2487413, 2=2487408}}}}].
2020-03-18T00:23:34,905 INFO [publish-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm]: SegmentListUsedAction{dataSource='crm_bd_capacity_ds', intervals=[2020-03-17T23:00:00.000Z/2020-03-18T01:00:00.000Z]}
2020-03-18T00:23:34,906 INFO [publish-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm] to overlord: [SegmentListUsedAction{dataSource='crm_bd_capacity_ds', intervals=[2020-03-17T23:00:00.000Z/2020-03-18T01:00:00.000Z]}].
2020-03-18T00:23:34,918 INFO [publish-0] org.apache.druid.storage.hdfs.HdfsDataSegmentKiller - Killing segment[crm_bd_capacity_ds_2020-03-17T23:00:00.000Z_2020-03-18T00:00:00.000Z_2020-03-17T23:26:39.118Z] mapped to path[hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200317T230000.000Z_20200318T000000.000Z/2020-03-17T23_26_39.118Z/0_41746007-f3d8-4bbb-ab51-c899fbced760_index.zip]
2020-03-18T00:23:34,948 INFO [publish-0] org.apache.druid.storage.hdfs.HdfsDataSegmentKiller - Killing segment[crm_bd_capacity_ds_2020-03-18T00:00:00.000Z_2020-03-18T01:00:00.000Z_2020-03-18T00:08:10.800Z] mapped to path[hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200318T000000.000Z_20200318T010000.000Z/2020-03-18T00_08_10.800Z/0_6cbb5e49-dac9-4c15-ae7f-63ca8deb7d49_index.zip]
2020-03-18T00:23:34,972 WARN [publish-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Failed publish, not removing segments: [DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-17T23:00:00.000Z_2020-03-18T00:00:00.000Z_2020-03-17T23:26:39.118Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200317T230000.000Z_20200318T000000.000Z/2020-03-17T23_26_39.118Z/0_41746007-f3d8-4bbb-ab51-c899fbced760_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=4309}, DataSegment{binaryVersion=9, id=crm_bd_capacity_ds_2020-03-18T00:00:00.000Z_2020-03-18T01:00:00.000Z_2020-03-18T00:08:10.800Z, loadSpec={type=>hdfs, path=>hdfs://mycluster/druid_one_for_all/segments/crm_bd_capacity_ds/20200318T000000.000Z_20200318T010000.000Z/2020-03-18T00_08_10.800Z/0_6cbb5e49-dac9-4c15-ae7f-63ca8deb7d49_index.zip}, dimensions=[level1_code, level2_code, level3_code, level4_code, level5_code, user_id], metrics=[cnt, merchant_create_cnt, store_create_cnt, audit_pass_cnt, realname_pass_cnt, terminal_bind_cnt, level5_average], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=5019}]
org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].
    at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:605) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_242]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
2020-03-18T00:23:34,974 ERROR [publish-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Error while publishing segments for sequenceNumber[SequenceMetadata{sequenceId=0, sequenceName='index_kafka_crm_bd_capacity_ds_374d70c186b714c_0', assignments=[], startOffsets={0=2487405, 1=2487408, 2=2487404}, exclusiveStartPartitions=[], endOffsets={0=2487410, 1=2487413, 2=2487408}, sentinel=false, checkpointed=true}]
org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!].
    at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:605) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_242]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
2020-03-18T00:23:34,978 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Shutting down immediately...
2020-03-18T00:23:34,980 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[crm_bd_capacity_ds_2020-03-17T23:00:00.000Z_2020-03-18T00:00:00.000Z_2020-03-17T23:26:39.118Z] at path[/druid_olap/segments/druid-data-004:8100/druid-data-004:8100_indexer-executor__default_tier_2020-03-17T23:26:39.205Z_8163eaba53dd40318594fafa58ba16d90]
2020-03-18T00:23:34,980 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[crm_bd_capacity_ds_2020-03-18T00:00:00.000Z_2020-03-18T01:00:00.000Z_2020-03-18T00:08:10.800Z] at path[/druid_olap/segments/druid-data-004:8100/druid-data-004:8100_indexer-executor__default_tier_2020-03-17T23:26:39.205Z_8163eaba53dd40318594fafa58ba16d90]
2020-03-18T00:23:34,980 INFO [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - unannouncing [/druid_olap/segments/druid-data-004:8100/druid-data-004:8100_indexer-executor__default_tier_2020-03-17T23:26:39.205Z_8163eaba53dd40318594fafa58ba16d90]
2020-03-18T00:23:34,992 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Unregistering chat handler[index_kafka_crm_bd_capacity_ds_374d70c186b714c_ocaijcmm]
2020-03-18T00:23:34,992 INFO [task-runner-0-priority-0] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannouncing [DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/middleManager', host='druid-data-004', bindOnHost=false, port=-1, plaintextPort=8100, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeType='PEON', services={dataNodeService=DataNodeService{tier='_default_tier', maxSize=0, type=indexer-executor, priority=0}, lookupNodeService=LookupNodeService{lookupTier='__default'}}}].
2020-03-18T00:23:34,992 INFO [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - unannouncing [/druid_olap/internal-discovery/PEON/druid-data-004:8100]
2020-03-18T00:23:34,997 INFO [task-runner-0-priority-0] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced [DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/middleManager', host='druid-data-004', bindOnHost=false, port=-1, plaintextPort=8100, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeType='PEON', services={dataNodeService=DataNodeService{tier='_default_tier', maxSize=0, type=indexer-executor, priority=0}, lookupNodeService=LookupNodeService{lookupTier='__default'}}}].
2020-03-18T00:23:34,997 INFO [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - unannouncing [/druid_olap/announcements/druid-data-004:8100]
2020-03-18T00:23:35,002 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception while running task.
im2kul commented 4 years ago

@jihoonson We are also facing the same issue with Druid 0.16.1-incubating version. Is this something that can be prioritized and released as a patch ASAP?

@qianmoQ & @Stephan3555 Thank you for the work around. We will try it out and report back if this error goes away.

smit-aterlo commented 3 years ago

I have an issue and I think it is similar to this one #10672. It occurs after almost 4-5 hours can someone please try to provide me a solution?

Thank you

misterbykl commented 3 years ago

@jihoonson @jon-wei

we are experiencing the same issue with version 0.20.0.

Is the data eventually reingested by a later task?

Yes, but it fails again. Tries 5 times and occasionally fails. Sometimes, on the 4th try, it succeeds.("segmentGranularity":"HOUR")

BTW, even if you saw this error, the new task will consume the same data from Kafka and create the same set of segments. So as long as your Kafka retention is long enough, you should be able to ingest all data.

Our retention time for related topic is 1 hour. What would be the ideal retention in this case?

That kind of situation should not happen too often, how often did you observe that kind of task failure?

It happens almost every hour during peak time. No task fails during night time.

scimas commented 3 years ago

I encountered the same problem with 0.20.0. The cause seems have been some inconsistency with the metadata. Fortunately this happened in testing and was solvable by a hard reset of the supervisor.

2021-04-15T20:05:08,573 ERROR [qtp398777435-101] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Not updating metadata, existing state[KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamEndSequenceNumbers{stream='redacted-topic-name', partitionSequenceNumberMap={0=57933}}}] in metadata store doesn't match to the new start state[KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream='redacted-topic-name', partitionSequenceNumberMap={0=0}, exclusivePartitions=[]}}].

This comment led me to check the coordinator-overlord log and found this information. It would be helpful if at least the error message in the task log could be improved though. The current message doesn't point in any helpful direction for further troubleshooting.

marioramaglia commented 3 years ago

Confirmed with 0.20.2; also it seems related to the fact that the ingested Kafka topic has partitions. Recreating the topic without partitions seems to fix the issue. Please help. thanks

pjain1 commented 2 years ago

One of the cause of this is when you resubmit the supervisor to consume from a different topic without changing the supervisor name, in that case you see following message in overlord logs

Not updating metadata, existing state[KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamEndSequenceNumbers{stream='STREAM1', partitionSequenceNumberMap={}] in metadata store doesn't match to the new start state[KafkaDataSourceMetadata{SeekableStreamStartSequenceNumbers=SeekableStreamStartSequenceNumbers{stream='STREAM2', partitionSequenceNumberMap={}, exclusivePartitions=[]}}]

Druid maintains end offsets of a topic for last published segments in druid_dataSource table and checks if it matches the start offsets of currently publishing segments for consistency. It uses the datasource name (which is same as supervisor name) as the key to store this metadata. So when you change the topic name, obviously the current start offsets to the task will not match with the stored end offsets hence task will fail.

Solution -

  1. If there is no existing datasource having same name as new topic/stream name, just terminate existing supervisor and submit a new supervisor having same datasource and topic name.
  2. If you want to keep the datasource name - (this is a hack) terminate currently running supervisor, delete the row for the corresponding dataSource in the druid_dataSource table and resubmit the supervisor. Word of caution - Since we are directly editing the metadata store do it at your own risk as it may cause data consistency issues for your druid datasource.

General advice is to keep the datasource name same as topic name and if you change topic then create a new supervisor with the changed name.

zxf216 commented 9 months ago

We have identified two scenarios where this error occurs: Scenario 1: Task has a large amount of data, which takes a long time to publish. If there are changes to Supervisors during publishing, it will cause real-time task failure。 Scenario 2: The first few tasks have no data, resulting in no update to the start information of Datasource in the metadata database. When data arrives, task fails to verify start during publishing.

https://www.processon.com/view/link/657019a9d995a67e9a3bea89