Supervisor marks succeeded replicas as failed too aggressively

jihoonson commented 5 years ago

Affected Version

All versions since 0.9.1.

Description

The seekableSupervisor does the below when a replica is succeeded.

Check the status of all other replicas from taskStorage.
Stop all replicas if they are not finished yet.
- For the tasks of unknown status, the supervisor kills them.
- If the stop request fails for some tasks, the supervisor kills them.

However, there's some race in this algorithm because task status is not updated in real time. Instead, the supervisor updates it per runNotice. As a result, the supervisor can kill some already finished tasks successfully if their status is not updated yet. This would lead to mark them as failed even though they are finished as succeeded in the task logs, which seems very confused.

One way to workaround this problem is to check task status more eagerly. However, this would just mitigate this issue happening less. I think we eventually need the following changes in the future.

Updating task status immediately when the status change is notified to the overlord.
Add a new task status for canceled tasks.

I'm seeing this problem happening very frequently in our cluster and so marking as a release blocker fo 0.15.0.

gianm commented 5 years ago

Did something change recently to make this more frequent? I'm asking because it seems strange to me that "all versions since 0.9.1" could be affected, yet it is now happening so often that it should be a release blocker. This didn't happen incredibly often in the past, AFAIK.

jihoonson commented 5 years ago

I have the same feeling, but not 100% sure what made this more frequent. Maybe https://github.com/apache/incubator-druid/pull/7234 is a bit related. And unannouncePropagationDelay is the thing that makes this always happening.

jihoonson commented 5 years ago

I have been looking at more deeply and noticed that this is what's happening.

The task was finished and unregistered its chatHandler.
But the process which was running the task was not terminated immediately. For example, unannouncePropagationDelay can block the process from being terminated for a while.
The task status is updated only after the task process terminates (ForkingTaskRunner).
So, while the task process was waiting to be terminated, the task status in the metadata store was still RUNNING even though the task itself was already finished.
The supervisor killed the task because it returned an error of Can't find chatHandler for the shutdown request.

jihoonson commented 5 years ago

I'm removing this from 0.15.0 since this issue has been for a while which means it's not a regression. Instead, I will mention this problem in the release notes.

apache / druid

Supervisor marks succeeded replicas as failed too aggressively #7828

Affected Version

Description