apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.53k stars 3.71k forks source link

k8s-based-ingestion: Wait for task lifecycles to enter RUNNING state before returning from KubernetesTaskRunner.start #17446

Closed georgew5656 closed 2 weeks ago

georgew5656 commented 3 weeks ago

Description

It's possible for the KubenretesTaskRunner.start method to return before the threads in the exec pool running (KubernetesPeonLifecycle.join) have finished gathering information about kubernetes jobs. This can be a problem because other services on the overlord (like supervisors) expect the KubenretesTaskRunner to have information about tasks once it has returned from start (for example each task's location).

This diff adds a wait in the start() method that attempts to wait for all the tasks that have been discovered to go into RUNNING state. This state indicates the KubernetesTaskRunner knows all the information that it needs about a task.

Release note

Bugfix that helps mitigate unexpected behavior when running k8s based ingestion during overlord restarts.

Key changed/added classes in this PR

This PR has:

georgew5656 commented 2 weeks ago

i don't think the failing unit test is related so i'm going to go ahead and merge this