Closed vijaysaravana closed 2 years ago
Thank you for opening your first issue here! 🛠
TL;DR I think the reproduce steps are explainable, but we may want to add a flag making it work as intended. It would help to have more information to debug this (Pod info on a SPOT instance).
OK, I've had some time to explore this in depth, a few things:
finalizer
to stop k8s from garbage collecting completed Pods before Flyte has the ability to detect them. But per the documentation, SPOT instances will attempt to delete a Pod gracefully (with the delete
API) and then wait at least 2 minutes before killing it, regardless of finalizers / etc.flyte/array
set) then when the autoscaler attempts to transition it to a SPOT instance it can not delete the Pod because it has a finalizer, so it is stuck in the terminating
state. This is a corner-case that we should probably handle. I think we could add an option, like permit-external-deletions
or something, that allows non-terminal Pods to be deleted. There are other conceivable situations where this would be useful.false
for the interruptible
label. Is the task being mapped setup as interruptible? I'm not sure this relates here, but setting up an interruptible task is meant to inject tolerations / etc to allow execution on SPOT instances.@hamersaw FYI that Vijay is my intern (and his project is to try to use map tasks as a replacement for some of our Spark tasks). I'll try to get you a little bit more information about the exact sequence of things in order to determine why the flyte/array
finalizer isn't being removed properly from the subtask pods before the spot instance terminates.
FYI that we don't actually use the interruptible
label as we have a bunch of logic to add annotations and tolerations to task pods in order to map them to a specific ASG (as we have several different ASGs for various capacity / cost reasons). The big ASG where we run Spark tasks (and now map subtasks) is indeed comprised of mostly spot (GPU) instances.
As you described above, I would think that the issue is that the K8s array plugin controller refuses to process delete
API events sent from the spot-termination-handler
pod running on the node (when it detects that the node has been selected for spot termination in two minutes) and needs to be augmented with logic to look for an annotation like permit-external-deletions
as you suggested.
Posting a link to the subtask finalizer (for myself so that I can find it again): https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/plugins/array/k8s/subtask.go#L216
@convexquad @vijaysaravana thanks for the clarification! I think this all sounds reasonable. I'll get started on adding a flag to FlytePropeller along the lines of permit-external-deletion
that exposes the described functionality.
Posting a link to the subtask finalizer (for myself so that I can find it again): https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/plugins/array/k8s/subtask.go#L216
Should note that this code is a near copy-paste of the plugin_manager in FlytePropeller. We have an open issue to refactor this. I'm noting because this ensures that map task subtasks and regular k8s container and pod plugins execute almost identically. We will have to update both locations to ensure that external deletes are recognized.
I checked the sequence of events in my K8s audit logs and they do happen as we suspected:
NodeNotReady
and the pod ready status is set to falseNoExecute
toleration set for 300 seconds. When that time period expires, I see taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="dev/av7mzxptf95tfqdd97f7-n2-0-173"
try to delete the pod. I think this deletion fails due to the flyte/array
finalizer, but I am not completely sure.Found orphaned Pod dev/av7mzxptf95tfqdd97f7-n2-0-173 assigned to the Node ip-10-165-218-77.us-west-2.compute.internal. Deleting.
in the event logs.PodGC is force deleting Pod: dev/av7mzxptf95tfqdd97f7-n2-0-173
and Forced deletion of orphaned Pod dev/av7mzxptf95tfqdd97f7-n2-0-173 succeeded
about twice an hour repeating every hour, however the deletion is not actually successful (almost certainly due to the finalizer). The pod remains in Terminating
status in my cluster.@hamersaw does Propeller already have an informer that notices pod deletion events for task (and subtask) pods? I am curious as to what logic we can add to Propeller to make this work. I guess the way it would work would be something like: when the informer sees a pod deletion request, check if the permit-external-deletion
flag (or perhaps pod annotation) is set. If so, actively call the finalizer routines for the pod (and remove the finalizer so the next pod deletion attempt will be successful). If the task has retries, will this enable Flyte then notice that the (sub)task pod has effectively failed and should be retried?
FYI not trying to bribe you, but if we can successfully simplify some of our ETL pipelines with Flyte map tasks (in place of Spark tasks), we are planning to write a nice blog about it. I think it would gather a lot of attention as many, many people find Spark difficult to use (of course sometimes you absolutely need to use Spark, basically when you need shuffle->reduce operations, but we do have many use cases that would be much more simple for users with map tasks).
@convexquad
I checked the sequence of events in my K8s audit logs and they do happen as we suspected:
* The node is marked as `NodeNotReady` and the pod ready status is set to false * Our pods have a `NoExecute` toleration set for 300 seconds. When that time period expires, I see `taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="dev/av7mzxptf95tfqdd97f7-n2-0-173"` try to delete the pod. I think this deletion fails due to the `flyte/array` finalizer, but I am not completely sure. * About a minute later the cluster seems to notice that the pod has no node and tries to GC it. I see `Found orphaned Pod dev/av7mzxptf95tfqdd97f7-n2-0-173 assigned to the Node ip-10-165-218-77.us-west-2.compute.internal. Deleting.` in the event logs. * Then I see `PodGC is force deleting Pod: dev/av7mzxptf95tfqdd97f7-n2-0-173` and `Forced deletion of orphaned Pod dev/av7mzxptf95tfqdd97f7-n2-0-173 succeeded` about twice an hour repeating every hour, however the deletion is not actually successful (almost certainly due to the finalizer). The pod remains in `Terminating` status in my cluster.
This is great to hear, happy to validate that this is the actual problem.
@hamersaw does Propeller already have an informer that notices pod deletion events for task (and subtask) pods? I am curious as to what logic we can add to Propeller to make this work. I guess the way it would work would be something like: when the informer sees a pod deletion request, check if the
permit-external-deletion
flag (or perhaps pod annotation) is set. If so, actively call the finalizer routines for the pod (and remove the finalizer so the next pod deletion attempt will be successful). If the task has retries, will this enable Flyte then notice that the (sub)task pod has effectively failed and should be retried?
Yes, in FlytePropeller we have an informerwhich enqueues the parent workflow when a Pod is updated. Currently, deletes are ignored, but this logic should change in our case.
Then when processing the node we already have a check for deletion. Rather than ignoring this, we should check for the permit-external-deletion
flag and proceed to abort / finalize as you mentioned. The finalizers will be removed as part of that process.
As I mentioned the plugin_manager is basically copy-pasted to the subtask processing in map tasks, we'll have to make sure this logic is consistent across (or better yet, remove the duplication).
@convexquad @vijaysaravana looks like this may be much more simple than anticipated. PRs above to support external deletion of map task subtasks. I'll work on getting these pushed through quickly.
@hamersaw thanks for the update, it is good to hear that this may be simple. Also thanks for your explanation about how the Flyte informer works.
One last request from me - if the permit-external-deletion
flag is set and a subtask pod is indeed deleted by an external process (e.g. a spot-termination-handler
), could you be sure to verify that if the subtask declares retries
that Flyte does indeed retry the subtask (and only the subtask and not the entire map task)?
It is super important for subtask retries to work correctly, since the whole idea is to replace large map-only Spark tasks with Flyte map tasks that run in our mixed on-demand + spot GPU ASG's so that subtasks get correctly retried when there is either node maintenance or spot pre-emption.
@hamersaw thanks for the update, it is good to hear that this may be simple. Also thanks for your explanation about how the Flyte informer works.
One last request from me - if the
permit-external-deletion
flag is set and a subtask pod is indeed deleted by an external process (e.g. aspot-termination-handler
), could you be sure to verify that if the subtask declaresretries
that Flyte does indeed retry the subtask (and only the subtask and not the entire map task)?It is super important for subtask retries to work correctly, since the whole idea is to replace large map-only Spark tasks with Flyte map tasks that run in our mixed on-demand + spot GPU ASG's so that subtasks get correctly retried when there is either node maintenance or spot pre-emption.
Great point. I will double check that everything is working correctly for this specific case, but we did change the map task implementation so that subtasks are retried individually with this PR. So everything should be working as you described.
@hamersaw we updated our Propeller and I do believe your fix has resolved the issue
@convexquad great to hear! Thanks for being so thoroughly descriptive and responsive, it really helps resolve these this quickly!
Describe the bug
When map tasks are run on SPOT instances, when the node dies the pod stays in a "Terminating" state forever (but it never actually completes terminating and erroring out). This causes the subtask to think it is still "running" forever (instead of retrying). But seems to work correctly for other types of Flyte tasks.
Expected behavior
If a SPOT instance is not available, the sub task should terminate gracefully and retry for the specified number of retries.
Additional context to reproduce
terminating
state.Screenshots
Pods list :
'Terminating' pod information:
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?