Open jihoonson opened 6 years ago
Failed status doesn't make sense for killed tasks. Their last status should be stopped or killed.
Yes, please do this. We trigger email alerts for failed tasks, and such a distinction would allow us to make that much less verbose.
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
Recently I noticed that sometimes succeeded tasks marked as failed since the supervisor killed them. This was because
1) One of tasks of the same taskGroup completed successfully and the supervisor sent the stop message to other tasks in the group. 2) The task didn't respond because of channel disconnection, but it succeeded. 3) The supervisor called
TaskQueue.shutdown()
which simply marks the task as failed.This doesn't make anything wrong, but can make kafka ingestion a bit slower because the supervisor spawns new tasks to reproduce the same data which were processed by the task marked as failed.
I think there are three issues to be fixed.
1)
IndexTaskClient
didn't retry on channel disconnection even though it's supposed to do. Here is the stack trace which occurred at https://github.com/apache/incubator-druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/common/IndexTaskClient.java#L384.2) The supervisor should check the last task status before killing tasks. In the supervisor, task status is cached in memory and updated periodically in
updateTaskStatus()
, while killing tasks can happen anytime. As a result, there can be some mismatches between the task status in the cache and that in the metadata store. 3)Failed
status doesn't make sense for killed tasks. Their last status should bestopped
orkilled
.