PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
16.34k stars 1.59k forks source link

Issue during deployment run failure - 'Could not determine exit code for X' #15259

Open k-wolski opened 1 month ago

k-wolski commented 1 month ago

Bug summary

Not every time, but regularly for failed deployment run, I see such error log:

Could not determine exit code for '{{deployment-name}}'.Exit code will be reported as -1.First container status info did not report an exit code.First container info: {'allocated_resources': None,
 'container_id': 'containerd://93e067da284cae987c07a019cb3713ed54cbaff36e7b3b950830c480aef40ff5',
 'image': '{{ image }}',
 'image_id': '{{ image }}',
 'last_state': {'running': None, 'terminated': None, 'waiting': None},
 'name': 'prefect-job',
 'ready': True,
 'resources': None,
 'restart_count': 0,
 'started': True,
 'state': {'running': {'started_at': datetime.datetime(2024, 9, 4, 15, 1, 18, tzinfo=tzlocal())},
           'terminated': None,
           'waiting': None},
 'volume_mounts': None}.

It never happened when we were running on prefect agent and started appearing after migration.

prefect-worker is running on kubernetes (both k3s and Azure AKS) with similar setup, and on both environments such logs are appearing.

base-job-template (anonymized):

{
  "variables": {
    "type": "object",
    "properties": {
      "name": {
        "type": "string",
        "title": "Name",
        "default": "{{name}}",
        "description": "Name given to infrastructure created by a worker."
      },
      "image": {
        "type": "string",
        "title": "Flow image",
        "default": "{{image}}",
        "description": "Image name and tag that will execute flows, to be provided from deployment"
      },
      "namespace": {
        "type": "string",
        "title": "Namespace",
        "default": "prefect",
        "description": "Namespace name where jobs will be scheduled"
      },
      "resources_request_cpu": {
        "type": "string",
        "title": "CPU request",
        "default": "500m",
        "description": "CPU request for a job scheduled by prefect"
      },
      "resources_limit_memory": {
        "type": "string",
        "title": "Memory limit",
        "default": "32Gi",
        "description": "Memory limit for a job scheduled by prefect"
      },
      "resources_request_memory": {
        "type": "string",
        "title": "Memory request",
        "default": "4Gi",
        "description": "Memory request for a job scheduled by prefect"
      },
      "job_watch_timeout_seconds": {
        "type": "string",
        "title": "Job watch timeout [s]",
        "default": "18000",
        "description": "Job watch timeout in seconds"
      },
      "pod_watch_timeout_seconds": {
        "type": "string",
        "title": "Pod watch timeout [s]",
        "default": "3600",
        "description": "Pod watch timeout in seconds"
      }
    }
  },
  "job_configuration": {
    "env": "",
    "name": "{{ name }}",
    "labels": "",
    "command": "",
    "namespace": "{{ namespace }}",
    "job_manifest": {
      "kind": "Job",
      "spec": {
        "template": {
          "spec": {
            "volumes": [
              {
                "name": "a-storage",
                "emptyDir": {}
              },
              {
                "name": "s3-storage",
                "emptyDir": {}
              },
              {
                "name": "prefect-home",
                "emptyDir": {}
              },
              {
                "name": "prodx",
                "emptyDir": {}
              },
              {
                "name": "prefect-worker-base-job-template-dbt-profile",
                "configMap": {
                  "name": "prefect-worker-base-job-template-dbt-profile"
                }
              }
            ],
            "containers": [
              {
                "env": [
                  {
                    "name": "DBT_PROFILES_DIR",
                    "value": "/home/nesso/dbt_profile"
                  },
                  {
                    "name": "PREFECT_HOME",
                    "value": "/home/nesso/prefect_home"
                  }
                ],
                "name": "prefect-job",
                "image": "{{ image }}",
                "envFrom": [
                  {
                    "configMapRef": {
                      "name": "prefect-worker-base-job-template-vars"
                    }
                  },
                  {
                    "secretRef": {
                      "name": "prefect-worker-base-job-template-secrets"
                    }
                  }
                ],
                "resources": {
                  "limits": {
                    "memory": "{{ resources_limit_memory }}"
                  },
                  "requests": {
                    "cpu": "{{ resources_request_cpu }}",
                    "memory": "{{ resources_request_memory }}"
                  }
                },
                "volumeMounts": [
                  {
                    "name": "a-storage",
                    "mountPath": "/tmp"
                  },
                  {
                    "name": "s3-storage",
                    "mountPath": "{{path}}"
                  },
                  {
                    "name": "prefect-home",
                    "mountPath": "/home/nesso/prefect_home"
                  },
                  {
                    "name": "prodx",
                    "mountPath": "/home/nesso/prodX"
                  },
                  {
                    "name": "prefect-worker-base-job-template-dbt-profile",
                    "mountPath": "/home/nesso/dbt_profile"
                  }
                ],
                "securityContext": {
                  "runAsUser": 1000,
                  "runAsGroup": 1000,
                  "capabilities": {
                    "drop": [
                      "ALL"
                    ]
                  },
                  "runAsNonRoot": true,
                  "readOnlyRootFilesystem": true,
                  "allowPrivilegeEscalation": false
                }
              }
            ],
            "completions": 1,
            "parallelism": 1,
            "tolerations": [],
            "restartPolicy": "Never",
            "securityContext": {
              "runAsUser": 1000,
              "runAsGroup": 1000,
              "capabilities": {
                "drop": [
                  "ALL"
                ]
              },
              "runAsNonRoot": true,
              "readOnlyRootFilesystem": true,
              "allowPrivilegeEscalation": false
            },
            "imagePullSecrets": [
              {
                "name": "prefect-worker-reg-creds"
              }
            ],
            "automountServiceAccountToken": false
          },
          "metadata": {
            "labels": {
              "chart": "prefect-worker",
              "business-owner": "",
              "technical-owner": "",
              "app.kubernetes.io/name": "prefect-worker"
            }
          }
        },
        "backoffLimit": 0,
        "ttlSecondsAfterFinished": 120
      },
      "metadata": {
        "labels": {
          "chart": "prefect-worker",
          "business-owner": "",
          "technical-owner": "",
          "app.kubernetes.io/name": "prefect-worker"
        },
        "namespace": "{{ namespace }}"
      },
      "apiVersion": "batch/v1"
    },
    "stream_output": true,
    "job_watch_timeout_seconds": "{{ job_watch_timeout_seconds }}",
    "pod_watch_timeout_seconds": "{{ pod_watch_timeout_seconds }}"
  }
}

In case any additional might help, please let me know what is needed. The problem is that it's not deterministic issue, so for the same failure it is appearing only on some occasions and it's hard to say what is causing that. For example when the issue appeared for failed run, I retried it from Prefect cloud UI, and with second run it didn't happen even though run failed in exactly the same way.

Version info (prefect version output)

Version:             2.20.4
API version:         0.8.4
Python version:      3.10.14
Git commit:          3b951c35
Built:               Wed, Aug 28, 2024 9:09 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         cloud

Additional context

No response

desertaxle commented 1 month ago

Thanks for the bug report @k-wolski! Could you please share what version of prefect-kubernetes you are using to run your worker?

k-wolski commented 1 month ago

Thank you @desertaxle . prefect-kubernetes is 0.4.2, and I see that 0.5.0 was just released, if you think upgrade can help I can try this.

In addition, I found also deployment that was successful, but still there was one error at the end:

Error during task execution: 
NoneType: None

And it's probably this code: https://github.com/PrefectHQ/prefect/blob/main/src/integrations/prefect-kubernetes/prefect_kubernetes/worker.py#L1009

desertaxle commented 1 month ago

prefect-kubernetes is compatible with version 3 of prefect, but not 2.x versions. It looks like we're having trouble determining the final state of the Kubernetes job. Are you able to see the final Kubernetes job state via kubectl?

k-wolski commented 1 month ago

I'm not sure what exactly you want me to check, but it's working fine in most of the cases. Here is output from Lens for one of the jobs:

image

Similarly with kubectl get pods I see that it's Completed:

k get pods                                      
NAME                              READY   STATUS      RESTARTS   AGE
deployment-name-jz2qt-qg58h       0/1     Completed   0          62s

Also with kubectl describe pod I see that pod is in Completed state:

Containers:
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 10 Sep 2024 16:06:25 +0200
      Finished:     Tue, 10 Sep 2024 16:06:48 +0200

But as I mentioned, it's not error I see every time, so some specific condition needs to be applied, and it's hard to say what is that from my perspective.

lucylu-coveo commented 1 month ago

We saw the same issue. Prefect version:

Version:             2.20.7
API version:         0.8.4
Python version:      3.9.20

For the worker, we use the docker image prefecthq/prefect:2.20.7-python3.9-kubernetes.

We started having this issue in prefect 2.20, before that we were using prefect 2.18.

k-wolski commented 1 month ago

@desertaxle, do you have any ideas how to handle it? It's making a lot of confusion for my team as they see Errors in logs and coming with questions to me what should they do.

lucylu-coveo commented 1 month ago

I believe it's related to this issue https://github.com/PrefectHQ/prefect/issues/14954#issuecomment-2310980939. If related, is it possible to release a fix for Prefect 2? Seems this issue has been fixed in Prefect 3.

desertaxle commented 1 month ago

@lucylu-coveo it's very possible this is related to this issue that you linked. I'll port the fix to keep the connection to Kubernetes alive to our Prefect 2 compatible version of prefect-kubernetes and we'll see if the issue persists.

desertaxle commented 1 month ago

Upon further inspection, the TCP keep-alive fix was released with prefect-kubernetes==0.4.3. I think this is a separate issue and will continue investigating a fix.

desertaxle commented 1 month ago

I have a potential fix for this issue in #15478. If anyone experiencing this issue can validate the fix, that would be very helpful! You can install the version of prefect-kubernetes with the potential fix with

pip install "git+https://github.com/PrefectHQ/prefect.git@fix-early-job-watch-exit#egg=prefect_kubernetes&subdirectory=src/integrations/prefect-kubernetes"

Otherwise, I'll release a new version of prefect-kubernetes and you can check the fix via a normal pip install.

lucylu-coveo commented 1 month ago

We use prefecthq/prefect:2.20.7-python3.9-kubernetes for the worker. Is there a way to know which version of prefect-kubernetes it's using?

desertaxle commented 1 month ago

@lucylu-coveo that image is packaged with prefect-kubernetes==0.4.3. You can check which version is installed with this command:

docker run prefecthq/prefect:2.20.7-python3.9-kubernetes python -c 'import prefect_kubernetes; print(prefect_kuberne
tes.__version__)'

If you're using the Docker image to run your Kubernetes worker, testing a dev version will be tough. I'll release a new version of prefect-kubernetes and update the version included in the 2.20.8 images for easier testing.

desertaxle commented 1 month ago

The 2.20.8 version of the prefect images have been updated to include a new version of prefect-kubernetes. Please give it a try and let me know if the issue has been fixed!

lucylu-coveo commented 1 month ago

@desertaxle We tried the image prefecthq/prefect:2.20.8-python3.9-kubernetes, and the issue came back.

desertaxle commented 1 month ago

Thanks for giving it a try @lucylu-coveo! Can you share logs from your worker when the failure occurred?

lucylu-coveo commented 1 month ago

Here's the log from the worker (anonymized some details). I inspected the logs from other flows experiencing this issue. From what I observed, the error seems always appear 5 minutes after the Kubernetes Job is created. Is there any timeout configs we missed?

01:59:57 AM  [INFO] Worker 'KubernetesWorker {{ worker_id }}' submitting flow run '{{ flow_run_id }}'

02:00:00 AM  [INFO] Creating Kubernetes job...

02:00:02 AM  [INFO] Job '{{ job_name }}': Starting watch for pod start...

02:00:03 AM  [INFO] Job '{{ job_name }}': Pod '{{ pod_name }}' has started.

02:00:03 AM  [INFO] Job '{{ job_name }}': Pod has status 'Pending'.

02:00:03 AM  [INFO] Job '{{ job_name }}': Pod '{{ pod_name }}' has started.

02:00:03 AM  [INFO] Job '{{ job_name }}': Pod '{{ pod_name }}' has started.

02:00:04 AM  [INFO] Completed submission of flow run '{{ flow_run_id }}'

02:00:05 AM  [INFO] Job '{{ job_name }}': Pod '{{ pod_name }}' has started.

02:00:05 AM  [INFO] Job '{{ job_name }}': Pod has status 'Running'.

02:05:06 AM  [ERROR] Error during task execution: 
                     NoneType: None

02:05:06 AM  [ERROR] Could not determine exit code for '{{ job_name }}'.Exit code will be reported as -1.First container status info did not report an exit code.First container info:  {'allocated_resources': None,
                'allocated_resources_status': None,
                'container_id': '{{ container_id }}',
                'image': '{{ image}}',
                'image_id': '{{ image}}',
                'last_state': {'running': None, 'terminated': None, 'waiting': None},
                'name': 'prefect-job',
                'ready': True,
                'resources': None,
                'restart_count': 0,
                'started': True,
                'state': {'running': {'started_at': datetime.datetime(2024, 9, 26, 6, 0, 3, tzinfo=tzlocal())},
                        'terminated': None,
                        'waiting': None},
                'user': None,
                'volume_mounts': None}.

02:05:07 AM [INFO]  Reported flow run '{{ flow_run_id }}' as crashed: Flow run infrastructure exited with non-zero status code -1.
desertaxle commented 1 month ago

Thanks for that extra info @lucylu-coveo! It is odd that this always happens after 5 minutes. Do you have job_watch_timeout_seconds set in your work pool?

lucylu-coveo commented 1 month ago

@desertaxle our job_watch_timeout_seconds is 86400 seconds (= 1 day)

desertaxle commented 1 month ago

@lucylu-coveo I've updated the worker to not mark the flow run as crashed when the container is still running in https://github.com/PrefectHQ/prefect/pull/15525. It doesn't address the root cause but will prevent flow runs from being errantly marked as crashed. I've also improved the error logging so that we can get more insight into what's causing the failure. I'll let you know when the changes are released so you can try them out!

desertaxle commented 1 month ago

A new version of the Kubernetes worker that aims to resolve this issue has been released in prefect_kubernetes==0.4.5. You can install this version with the prefecthq/prefect:2.20.9 images with the -kubernetes suffix. Please try this new version to confirm if it resolves the issue.

k-wolski commented 1 month ago

@desertaxle, I'm not sure if it's related, but even before upgrading to 2.20.9 (what I did ~1h ago), during the night we had weird issue for the first time:

Screenshot 2024-10-02 at 10 34 08

As you can see, both tasks were successful, and after 12s it should be marked as Completed. But Prefect wasn't able to mark it, and it was waiting for 5h timeout we have set, then it was killed with State changed due to long running. Time out. error.

It's not happening for all flows again, only for some of them and I don't see any pattern yet, but maybe it's already worth to inform you about it.

chrisguidry commented 1 month ago

Hi @k-wolski, I was debugging this with @desertaxle and we realized this was a misfire of your automation, which saw the flow run's Completed event before it saw the Running event. I believe this is related to a Prefect Cloud configuration change I made Tuesday that would have made this case slightly more likely. I'll be adjusting that parameter to prevent these kinds of issues from cropping up. Sorry about that!

k-wolski commented 1 month ago

Thank you for your response @chrisguidry. It's great to see that you figured out root cause on your end, and it also explains why it appeared only for some runs.

Going back to main topic in this issue, I didn't found any issues with latest version, but I'm also not actively working on that layer so it doesn't mean it fully disappeared. I would wait for @lucylu-coveo to confirm if it's working fine, and then probably we can close it.

AdamMartineau-Coveo commented 1 month ago

Hello :wave: @k-wolski I'm a colleague of Lucy! We'll try the new version and see if it solve the issue :)

mikeoconnor0308 commented 1 month ago

We experience this issue too - I'll update our workers and report back for another data point.

We often, but not exclusively, see this with pods that are slow to start for example if they need to download a big container image or scale up a node.

mikeoconnor0308 commented 3 weeks ago

Ok I think our issue is in a similar area of the code but not the same, I've opened a new issue: https://github.com/PrefectHQ/prefect/issues/15622

I suspect the root cause here may be similar though, the fact that kubernetes_asyncio does not seem to respect timeouts.

https://github.com/tomplus/kubernetes_asyncio/issues/308

soamicharan commented 2 weeks ago

I debugged the issue, there are two possible issue causing this