[BUG] Map tasks running on SPOT instances - pods stuck in `terminating` state forever

vijaysaravana commented 2 years ago

Describe the bug

When map tasks are run on SPOT instances, when the node dies the pod stays in a "Terminating" state forever (but it never actually completes terminating and erroring out). This causes the subtask to think it is still "running" forever (instead of retrying). But seems to work correctly for other types of Flyte tasks.

Expected behavior

If a SPOT instance is not available, the sub task should terminate gracefully and retry for the specified number of retries.

Additional context to reproduce

Run a heavy map task with around ~100 sub tasks (using K8S array) which requires GPU instances on AWS.
Change AWS Autoscaling group configuration to support on-demand instances initially.
When some map sub tasks have SUCCEEDED, proceed to switch from on-demand to SPOT instances from Auto-scaling group console.
You will find some pods getting stuck at terminating state.

Screenshots

Pods list :

vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pods -n dev | grep a8mwjq5z94p55fxhk9zl
a8mwjq5z94p55fxhk9zl-n2-0-0              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-1              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-11             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-13             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-19             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-2              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-24             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-27             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-3              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-30             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-31             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-34             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-36             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-37             0/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-4              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-43             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-49             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-5              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-50             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-53             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-54             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-56             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-57             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-6              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-7              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-72             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-76             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-8              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-82             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-86             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-87             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-9              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-91             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-92             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-94             0/1     Completed          0          9h

'Terminating' pod information:

vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pod a8mwjq5z94p55fxhk9zl-n2-0-27 -n dev -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    flyte.lyft.com/deployment: flyte-l5
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2022-07-19T05:47:08Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-07-19T05:59:45Z"
  finalizers:
  - flyte/array
  labels:
    domain: dev
    execution-id: a8mwjq5z94p55fxhk9zl
    interruptible: "false"
    manager: avora
    node-id: n2
    owner-email: mtoledo
    owner-name: mtoledo
    platform: flyte
    project: avdelorean
    shard-key: "21"
    task-name: src-backend-delorean-delorean-map-base-mapper-run-map-task-0
    team: compute-infra
    workflow-name: src-planning-lib-prediction-metrics-prediction-metrics-processo
  name: a8mwjq5z94p55fxhk9zl-n2-0-27
  namespace: dev
  ownerReferences:
  - apiVersion: flyte.lyft.com/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: a8mwjq5z94p55fxhk9zl
    uid: 0d0a8d7c-f935-437c-b339-d003c7643827
  resourceVersion: "9478565284"
  uid: 78bb2022-c7d6-4f47-9832-d12656cbdb2c
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: l5.lyft.com/pool
            operator: In
            values:
            - eks-pdx-pool-gpu
  containers:
  - args:
    - pyflyte-map-execute
    - --inputs
    - s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/inputs.pb
    - --output-prefix
    - s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/0
    - --raw-output-data-prefix
    - s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0
    - --checkpoint-path
    - s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0/_flytecheckpoints
    - --prev-checkpoint
    - '""'
    - --resolver
    - flytekit.core.python_auto_container.default_task_resolver
    - --
    - task-module
    - src.backend.delorean.delorean_map_base
    - task-name
    - run_map_task
    env:
    - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
      value: avdelorean:dev:src.planning.lib.prediction.metrics.prediction_metrics_processor_wfe_map_class.PredictionMetricsProcessorMapWorkflowPerfTest
    - name: FLYTE_INTERNAL_EXECUTION_ID
      value: a8mwjq5z94p55fxhk9zl
    - name: FLYTE_INTERNAL_EXECUTION_PROJECT
      value: avdelorean
    - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
      value: dev
    - name: FLYTE_ATTEMPT_NUMBER
      value: "0"
    - name: FLYTE_INTERNAL_TASK_PROJECT
      value: avdelorean
    - name: FLYTE_INTERNAL_TASK_DOMAIN
      value: dev
    - name: FLYTE_INTERNAL_TASK_NAME
      value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
    - name: FLYTE_INTERNAL_TASK_VERSION
      value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    - name: FLYTE_INTERNAL_PROJECT
      value: avdelorean
    - name: FLYTE_INTERNAL_DOMAIN
      value: dev
    - name: FLYTE_INTERNAL_NAME
      value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
    - name: FLYTE_INTERNAL_VERSION
      value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    - name: KUBERNETES_REQUEST_TIMEOUT
      value: "100000"
    - name: L5_BASE_DOMAIN
      value: l5.woven-planet.tech
    - name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
      value: "20"
    - name: AWS_METADATA_SERVICE_TIMEOUT
      value: "5"
    - name: FLYTE_STATSD_HOST
      value: flyte-telegraf.infrastructure
    - name: KUBERNETES_CLUSTER_NAME
      value: pdx
    - name: FLYTE_K8S_ARRAY_INDEX
      value: "27"
    - name: BATCH_JOB_ARRAY_INDEX_VAR_NAME
      value: FLYTE_K8S_ARRAY_INDEX
    - name: L5_DATACENTER
      value: pdx
    - name: L5_ENVIRONMENT
      value: pdx
    - name: RUNTIME_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: RUNTIME_POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: RUNTIME_NODE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: L5_NAMESPACE
      value: dev
    image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    imagePullPolicy: IfNotPresent
    name: a8mwjq5z94p55fxhk9zl-n2-0-27
    resources:
      limits:
        cpu: "4"
        memory: 56Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "4"
        memory: 56Gi
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-c459j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-162-107-6.us-west-2.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: flyte-scheduler
  securityContext:
    fsGroup: 65534
  serviceAccount: avdelorean-dev
  serviceAccountName: avdelorean-dev
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: lyft.com/gpu
    operator: Equal
    value: dedicated
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  volumes:
  - name: kube-api-access-c459j
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:51:35Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:54:40Z"
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:51:37Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:51:35Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://e07e4f0cec8265cd15c57833782c305e5581c9d463a97af8307ce3c40bd2c324
    image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    imageID: docker-pullable://ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test@sha256:172ecf248838b1ec88e520528f0125451043769fb31c26d0bfc55057c98afabf
    lastState: {}
    name: a8mwjq5z94p55fxhk9zl-n2-0-27
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-07-19T05:51:37Z"
  hostIP: 10.162.107.6
  phase: Running
  podIP: 10.162.72.241
  podIPs:
  - ip: 10.162.72.241
  qosClass: Guaranteed
  startTime: "2022-07-19T05:51:35Z"

Are you sure this issue hasn't been raised already?

[X] Yes

Have you read the Code of Conduct?

[X] Yes

welcome[bot] commented 2 years ago

Thank you for opening your first issue here! 🛠

hamersaw commented 2 years ago

TL;DR I think the reproduce steps are explainable, but we may want to add a flag making it work as intended. It would help to have more information to debug this (Pod info on a SPOT instance).

OK, I've had some time to explore this in depth, a few things:

With this PR we updated map tasks to use the same k8s execution code as regular k8s tasks, so retries / etc should be handled identically. It's not clear to me why this would work on regular tasks and not map tasks.
AFAIK there are no settings where we can stop a SPOT instance from deleting a Pod. We set the pod finalizer to stop k8s from garbage collecting completed Pods before Flyte has the ability to detect them. But per the documentation, SPOT instances will attempt to delete a Pod gracefully (with the delete API) and then wait at least 2 minutes before killing it, regardless of finalizers / etc.
The steps described to reproduce this are explainable. We deploy a Pod on an on-demand instance (with finalizer flyte/array set) then when the autoscaler attempts to transition it to a SPOT instance it can not delete the Pod because it has a finalizer, so it is stuck in the terminating state. This is a corner-case that we should probably handle. I think we could add an option, like permit-external-deletions or something, that allows non-terminal Pods to be deleted. There are other conceivable situations where this would be useful.
I also noticed the Pod in the description has false for the interruptible label. Is the task being mapped setup as interruptible? I'm not sure this relates here, but setting up an interruptible task is meant to inject tolerations / etc to allow execution on SPOT instances.

convexquad commented 2 years ago

@hamersaw FYI that Vijay is my intern (and his project is to try to use map tasks as a replacement for some of our Spark tasks). I'll try to get you a little bit more information about the exact sequence of things in order to determine why the flyte/array finalizer isn't being removed properly from the subtask pods before the spot instance terminates.

FYI that we don't actually use the interruptible label as we have a bunch of logic to add annotations and tolerations to task pods in order to map them to a specific ASG (as we have several different ASGs for various capacity / cost reasons). The big ASG where we run Spark tasks (and now map subtasks) is indeed comprised of mostly spot (GPU) instances.

As you described above, I would think that the issue is that the K8s array plugin controller refuses to process delete API events sent from the spot-termination-handler pod running on the node (when it detects that the node has been selected for spot termination in two minutes) and needs to be augmented with logic to look for an annotation like permit-external-deletions as you suggested.

convexquad commented 2 years ago

Posting a link to the subtask finalizer (for myself so that I can find it again): https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/plugins/array/k8s/subtask.go#L216

hamersaw commented 2 years ago

@convexquad @vijaysaravana thanks for the clarification! I think this all sounds reasonable. I'll get started on adding a flag to FlytePropeller along the lines of permit-external-deletion that exposes the described functionality.

hamersaw commented 2 years ago

Posting a link to the subtask finalizer (for myself so that I can find it again): https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/plugins/array/k8s/subtask.go#L216

Should note that this code is a near copy-paste of the plugin_manager in FlytePropeller. We have an open issue to refactor this. I'm noting because this ensures that map task subtasks and regular k8s container and pod plugins execute almost identically. We will have to update both locations to ensure that external deletes are recognized.

convexquad commented 2 years ago

I checked the sequence of events in my K8s audit logs and they do happen as we suspected:

The node is marked as NodeNotReady and the pod ready status is set to false
Our pods have a NoExecute toleration set for 300 seconds. When that time period expires, I see taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="dev/av7mzxptf95tfqdd97f7-n2-0-173" try to delete the pod. I think this deletion fails due to the flyte/array finalizer, but I am not completely sure.
About a minute later the cluster seems to notice that the pod has no node and tries to GC it. I see Found orphaned Pod dev/av7mzxptf95tfqdd97f7-n2-0-173 assigned to the Node ip-10-165-218-77.us-west-2.compute.internal. Deleting. in the event logs.
Then I see PodGC is force deleting Pod: dev/av7mzxptf95tfqdd97f7-n2-0-173 and Forced deletion of orphaned Pod dev/av7mzxptf95tfqdd97f7-n2-0-173 succeeded about twice an hour repeating every hour, however the deletion is not actually successful (almost certainly due to the finalizer). The pod remains in Terminating status in my cluster.

@hamersaw does Propeller already have an informer that notices pod deletion events for task (and subtask) pods? I am curious as to what logic we can add to Propeller to make this work. I guess the way it would work would be something like: when the informer sees a pod deletion request, check if the permit-external-deletion flag (or perhaps pod annotation) is set. If so, actively call the finalizer routines for the pod (and remove the finalizer so the next pod deletion attempt will be successful). If the task has retries, will this enable Flyte then notice that the (sub)task pod has effectively failed and should be retried?

convexquad commented 2 years ago

FYI not trying to bribe you, but if we can successfully simplify some of our ETL pipelines with Flyte map tasks (in place of Spark tasks), we are planning to write a nice blog about it. I think it would gather a lot of attention as many, many people find Spark difficult to use (of course sometimes you absolutely need to use Spark, basically when you need shuffle->reduce operations, but we do have many use cases that would be much more simple for users with map tasks).

hamersaw commented 2 years ago

@convexquad

I checked the sequence of events in my K8s audit logs and they do happen as we suspected:
* The node is marked as `NodeNotReady` and the pod ready status is set to false

* Our pods have a `NoExecute` toleration set for 300 seconds. When that time period expires, I see `taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="dev/av7mzxptf95tfqdd97f7-n2-0-173"` try to delete the pod. I think this deletion fails due to the `flyte/array` finalizer, but I am not completely sure.

* About a minute later the cluster seems to notice that the pod has no node and tries to GC it. I see `Found orphaned Pod dev/av7mzxptf95tfqdd97f7-n2-0-173 assigned to the Node ip-10-165-218-77.us-west-2.compute.internal. Deleting.` in the event logs.

* Then I see `PodGC is force deleting Pod: dev/av7mzxptf95tfqdd97f7-n2-0-173` and `Forced deletion of orphaned Pod dev/av7mzxptf95tfqdd97f7-n2-0-173 succeeded` about twice an hour repeating every hour, however the deletion is not actually successful (almost certainly due to the finalizer). The pod remains in `Terminating` status in my cluster.
This is great to hear, happy to validate that this is the actual problem.

@hamersaw does Propeller already have an informer that notices pod deletion events for task (and subtask) pods? I am curious as to what logic we can add to Propeller to make this work. I guess the way it would work would be something like: when the informer sees a pod deletion request, check if the permit-external-deletion flag (or perhaps pod annotation) is set. If so, actively call the finalizer routines for the pod (and remove the finalizer so the next pod deletion attempt will be successful). If the task has retries, will this enable Flyte then notice that the (sub)task pod has effectively failed and should be retried?

Yes, in FlytePropeller we have an informerwhich enqueues the parent workflow when a Pod is updated. Currently, deletes are ignored, but this logic should change in our case.

Then when processing the node we already have a check for deletion. Rather than ignoring this, we should check for the permit-external-deletion flag and proceed to abort / finalize as you mentioned. The finalizers will be removed as part of that process.

As I mentioned the plugin_manager is basically copy-pasted to the subtask processing in map tasks, we'll have to make sure this logic is consistent across (or better yet, remove the duplication).

hamersaw commented 2 years ago

@convexquad @vijaysaravana looks like this may be much more simple than anticipated. PRs above to support external deletion of map task subtasks. I'll work on getting these pushed through quickly.

convexquad commented 2 years ago

@hamersaw thanks for the update, it is good to hear that this may be simple. Also thanks for your explanation about how the Flyte informer works.

One last request from me - if the permit-external-deletion flag is set and a subtask pod is indeed deleted by an external process (e.g. a spot-termination-handler), could you be sure to verify that if the subtask declares retries that Flyte does indeed retry the subtask (and only the subtask and not the entire map task)?

It is super important for subtask retries to work correctly, since the whole idea is to replace large map-only Spark tasks with Flyte map tasks that run in our mixed on-demand + spot GPU ASG's so that subtasks get correctly retried when there is either node maintenance or spot pre-emption.

hamersaw commented 2 years ago

@hamersaw thanks for the update, it is good to hear that this may be simple. Also thanks for your explanation about how the Flyte informer works.

One last request from me - if the permit-external-deletion flag is set and a subtask pod is indeed deleted by an external process (e.g. a spot-termination-handler), could you be sure to verify that if the subtask declares retries that Flyte does indeed retry the subtask (and only the subtask and not the entire map task)?

It is super important for subtask retries to work correctly, since the whole idea is to replace large map-only Spark tasks with Flyte map tasks that run in our mixed on-demand + spot GPU ASG's so that subtasks get correctly retried when there is either node maintenance or spot pre-emption.

Great point. I will double check that everything is working correctly for this specific case, but we did change the map task implementation so that subtasks are retried individually with this PR. So everything should be working as you described.

convexquad commented 2 years ago

@hamersaw we updated our Propeller and I do believe your fix has resolved the issue

hamersaw commented 2 years ago

@convexquad great to hear! Thanks for being so thoroughly descriptive and responsive, it really helps resolve these this quickly!

flyteorg / flyte