Agent not detecting flow crash when EC2 spot instance revoked

paulinjo commented 1 year ago

First check

[X] I added a descriptive title to this issue.
[X] I used the GitHub search to find a similar issue and didn't find it.
[X] I searched the Prefect documentation for this issue.
[X] I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

We run Prefect on an EKS cluster made primarily of EC2 spot instances. After receiving a BidEvictedEvent, the aws-node-termination-handler will drain the node gracefully, killing any Prefect job pods which may be running on it.

Even though the Prefect agent raises an error that the job container cannot be found, Prefect cloud will leave the job in a running state instead of marking it as crashed.

The flow run is using a Kubernetes infrastructure block.

Reproduction

N/A

Logs

[
    {
        "@timestamp": "2023-04-17 12:03:34.902",
        "@message": {
            "az": "us-east-1b",
            "ec2_instance_id": "i-0bfbe6ed3e71b9d24",
            "log": "2023-04-17T12:03:34.902154444Z stdout F 2023/04/17 12:03:34 INF Adding new event to the event store event={\"AutoScalingGroupName\":\"\",\"Description\":\"Spot ITN received. Instance will be interrupted at 2023-04-17T12:05:31Z \\n\",\"EndTime\":\"0001-01-01T00:00:00Z\",\"EventID\":\"spot-itn-aca1aaae362f8bf5b28dcf1b0912c5ea65982e0dd63e647c80a2f78678d55334\",\"InProgress\":false,\"InstanceID\":\"\",\"IsManaged\":false,\"Kind\":\"SPOT_ITN\",\"Monitor\":\"SPOT_ITN_MONITOR\",\"NodeLabels\":null,\"NodeName\":\"ip-10-160-20-154.ec2.internal\",\"NodeProcessed\":false,\"Pods\":null,\"ProviderID\":\"\",\"StartTime\":\"2023-04-17T12:05:31Z\",\"State\":\"\"}"
        },
        "@logStream": "/fluentbit-default",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/dataplane"
    },
    {
        "@timestamp": "2023-04-17 12:03:36.168",
        "@message": {
            "az": "us-east-1b",
            "ec2_instance_id": "i-0bfbe6ed3e71b9d24",
            "hostname": "ip-10-160-20-154.ec2.internal",
            "message": "I0417 12:03:36.167940    4497 kuberuntime_container.go:702] \"Killing container with a grace period\" pod=\"prefect-orion/electric-pigeon-hmbp6-69q59\" podUID=66249ddb-ac0a-4d30-bde3-33e4e5cf2bb4 containerName=\"prefect-job\" containerID=\"containerd://a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0\" gracePeriod=30",
            "systemd_unit": "kubelet.service"
        },
        "@logStream": "kubelet.service-ip-10-160-20-154.ec2.internal",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/dataplane"
    },
    {
        "@timestamp": "2023-04-17 12:04:05.782",
        "@message": {
            "kubernetes": {
                "container_hash": "650551417061.dkr.ecr.us-east-1.amazonaws.com/mercury@sha256:5426e855ad378c8c7be0cfd6c2cabe850a3b4879b5118f6f8ed791d8b539c62d",
                "container_image": "650551417061.dkr.ecr.us-east-1.amazonaws.com/mercury:data-prefect.prefect-runtime",
                "container_name": "prefect-job",
                "docker_id": "a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0",
                "host": "ip-10-160-20-154.ec2.internal",
                "labels": {
                    "controller-uid": "00456970-bc69-4346-ba7a-b7a5529004bb",
                    "job-name": "electric-pigeon-hmbp6"
                },
                "namespace_name": "prefect-orion",
                "pod_id": "66249ddb-ac0a-4d30-bde3-33e4e5cf2bb4",
                "pod_name": "electric-pigeon-hmbp6-69q59"
            },
            "log": "2023-04-17T12:04:05.782652673Z stderr F 12:04:05.781 | INFO    | Task run 'extract_signups_and_revisions-149' - Finished in state Completed()"
        },
        "@logStream": "electric-pigeon-hmbp6-69q59_prefect-orion_prefect-job-a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/application"
    },
    {
        "@timestamp": "2023-04-17 12:04:25.712",
        "@message": {
            "kubernetes": {
                "container_hash": "docker.io/prefecthq/prefect@sha256:e9f83df992b718a1f1a03c2567f3dbba120e6ef70ae9ba62efcdbbc0ef1a37d3",
                "container_image": "docker.io/prefecthq/prefect:2.9.0-python3.9",
                "container_name": "agent",
                "docker_id": "528412c4de7d466e22dd71f54072d6e02be599c8606fd6289eb82f3c9c5c1365",
                "host": "ip-10-160-10-242.ec2.internal",
                "labels": {
                    "app.kubernetes.io/instance": "prefect-agent-orion",
                    "app.kubernetes.io/name": "prefect-orion-agent",
                    "pod-template-hash": "8c7b86d78"
                },
                "namespace_name": "prefect-orion",
                "pod_id": "b805558a-85fd-429a-a013-2867847b1b30",
                "pod_name": "prefect-agent-orion-prefect-orion-agent-8c7b86d78-th67x"
            },
            "log": "2023-04-17T12:04:25.712695276Z stdout F rpc error: code = NotFound desc = an error occurred when try to find container \"a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0\": not found"
        },
        "@logStream": "prefect-agent-orion-prefect-orion-agent-8c7b86d78-th67x_prefect-orion_agent-528412c4de7d466e22dd71f54072d6e02be599c8606fd6289eb82f3c9c5c1365",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/application"
    }
]

Versions

2.10.4

Additional context

No response

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

paulinjo commented 1 year ago

Have there been any updates which would have addressed this?

zanieb commented 1 year ago

Can you provide a reproduction that does not rely on spot instance eviction? We will need to be able to test changes to resolve this. Ideally the example would not require AWS.

A possible solution is to report flow runs as CRASHED if the infrastructure cannot be found to report a status.

paulinjo commented 1 year ago

Unfortunately I cannot provide a reproduction outside of spot instance eviction.

All of our EKS clusters use exclusively spot instances for ETL jobs to cut back on cost, so this is entirely representative of our workloads.

We also had another instance of this last night which proved to be very disruptive, since some external systems rely on accurate flow run state.

zangell44 commented 1 year ago

@paulinjo I think the handling added for STOPPED jobs and missing containers in this PR https://github.com/PrefectHQ/prefect/pull/10125 should resolve the issue you're seeing.

After the release today, could you try upgrading your agent and runtime environment to 2.10.21?

paulinjo commented 1 year ago

Sure thing.

zangell44 commented 1 year ago

Despite #10125, there are still reports of flow runs not being marked as Crashed correctly when spot instances are revoked. We are continuing to investigate

zhen0 commented 1 year ago

Possibly connected to #10141

zangell44 commented 1 year ago

After testing internally, we think the issue after #10125 is the pod status does not have termination information, resulting in this error

An error occurred while monitoring flow run '3baff259-cfac-4685-a4b0-2fc33504685e'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 530, in run
    status_code = await run_sync_in_worker_thread(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 857, in _watch_job
    return first_container_status.state.terminated.exit_code
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'exit_code'

The fix in https://github.com/PrefectHQ/prefect-kubernetes/pull/85 should resolve the issue and we will backport to Prefect Agents too.

zangell44 commented 1 year ago

This issue should be resolved with the release of Prefect 2.11.4 today. Please let us know if you still experience issues!

PrefectHQ / prefect