Open k-wolski opened 1 month ago
Thanks for the bug report @k-wolski! Could you please share what version of prefect-kubernetes
you are using to run your worker?
Thank you @desertaxle . prefect-kubernetes is 0.4.2, and I see that 0.5.0 was just released, if you think upgrade can help I can try this.
In addition, I found also deployment that was successful, but still there was one error at the end:
Error during task execution:
NoneType: None
And it's probably this code: https://github.com/PrefectHQ/prefect/blob/main/src/integrations/prefect-kubernetes/prefect_kubernetes/worker.py#L1009
prefect-kubernetes
is compatible with version 3 of prefect
, but not 2.x
versions. It looks like we're having trouble determining the final state of the Kubernetes job. Are you able to see the final Kubernetes job state via kubectl
?
I'm not sure what exactly you want me to check, but it's working fine in most of the cases. Here is output from Lens for one of the jobs:
Similarly with kubectl get pods I see that it's Completed:
k get pods
NAME READY STATUS RESTARTS AGE
deployment-name-jz2qt-qg58h 0/1 Completed 0 62s
Also with kubectl describe pod
I see that pod is in Completed
state:
Containers:
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 10 Sep 2024 16:06:25 +0200
Finished: Tue, 10 Sep 2024 16:06:48 +0200
But as I mentioned, it's not error I see every time, so some specific condition needs to be applied, and it's hard to say what is that from my perspective.
We saw the same issue. Prefect version:
Version: 2.20.7
API version: 0.8.4
Python version: 3.9.20
For the worker, we use the docker image prefecthq/prefect:2.20.7-python3.9-kubernetes
.
We started having this issue in prefect 2.20, before that we were using prefect 2.18.
@desertaxle, do you have any ideas how to handle it? It's making a lot of confusion for my team as they see Errors in logs and coming with questions to me what should they do.
I believe it's related to this issue https://github.com/PrefectHQ/prefect/issues/14954#issuecomment-2310980939. If related, is it possible to release a fix for Prefect 2? Seems this issue has been fixed in Prefect 3.
@lucylu-coveo it's very possible this is related to this issue that you linked. I'll port the fix to keep the connection to Kubernetes alive to our Prefect 2 compatible version of prefect-kubernetes
and we'll see if the issue persists.
Upon further inspection, the TCP keep-alive fix was released with prefect-kubernetes==0.4.3
. I think this is a separate issue and will continue investigating a fix.
I have a potential fix for this issue in #15478. If anyone experiencing this issue can validate the fix, that would be very helpful! You can install the version of prefect-kubernetes
with the potential fix with
pip install "git+https://github.com/PrefectHQ/prefect.git@fix-early-job-watch-exit#egg=prefect_kubernetes&subdirectory=src/integrations/prefect-kubernetes"
Otherwise, I'll release a new version of prefect-kubernetes
and you can check the fix via a normal pip install.
We use prefecthq/prefect:2.20.7-python3.9-kubernetes
for the worker. Is there a way to know which version of prefect-kubernetes
it's using?
@lucylu-coveo that image is packaged with prefect-kubernetes==0.4.3
. You can check which version is installed with this command:
docker run prefecthq/prefect:2.20.7-python3.9-kubernetes python -c 'import prefect_kubernetes; print(prefect_kuberne
tes.__version__)'
If you're using the Docker image to run your Kubernetes worker, testing a dev version will be tough. I'll release a new version of prefect-kubernetes
and update the version included in the 2.20.8
images for easier testing.
The 2.20.8
version of the prefect
images have been updated to include a new version of prefect-kubernetes
. Please give it a try and let me know if the issue has been fixed!
@desertaxle We tried the image prefecthq/prefect:2.20.8-python3.9-kubernetes
, and the issue came back.
Thanks for giving it a try @lucylu-coveo! Can you share logs from your worker when the failure occurred?
Here's the log from the worker (anonymized some details). I inspected the logs from other flows experiencing this issue. From what I observed, the error seems always appear 5 minutes after the Kubernetes Job is created. Is there any timeout configs we missed?
01:59:57 AM [INFO] Worker 'KubernetesWorker {{ worker_id }}' submitting flow run '{{ flow_run_id }}'
02:00:00 AM [INFO] Creating Kubernetes job...
02:00:02 AM [INFO] Job '{{ job_name }}': Starting watch for pod start...
02:00:03 AM [INFO] Job '{{ job_name }}': Pod '{{ pod_name }}' has started.
02:00:03 AM [INFO] Job '{{ job_name }}': Pod has status 'Pending'.
02:00:03 AM [INFO] Job '{{ job_name }}': Pod '{{ pod_name }}' has started.
02:00:03 AM [INFO] Job '{{ job_name }}': Pod '{{ pod_name }}' has started.
02:00:04 AM [INFO] Completed submission of flow run '{{ flow_run_id }}'
02:00:05 AM [INFO] Job '{{ job_name }}': Pod '{{ pod_name }}' has started.
02:00:05 AM [INFO] Job '{{ job_name }}': Pod has status 'Running'.
02:05:06 AM [ERROR] Error during task execution:
NoneType: None
02:05:06 AM [ERROR] Could not determine exit code for '{{ job_name }}'.Exit code will be reported as -1.First container status info did not report an exit code.First container info: {'allocated_resources': None,
'allocated_resources_status': None,
'container_id': '{{ container_id }}',
'image': '{{ image}}',
'image_id': '{{ image}}',
'last_state': {'running': None, 'terminated': None, 'waiting': None},
'name': 'prefect-job',
'ready': True,
'resources': None,
'restart_count': 0,
'started': True,
'state': {'running': {'started_at': datetime.datetime(2024, 9, 26, 6, 0, 3, tzinfo=tzlocal())},
'terminated': None,
'waiting': None},
'user': None,
'volume_mounts': None}.
02:05:07 AM [INFO] Reported flow run '{{ flow_run_id }}' as crashed: Flow run infrastructure exited with non-zero status code -1.
Thanks for that extra info @lucylu-coveo! It is odd that this always happens after 5 minutes. Do you have job_watch_timeout_seconds
set in your work pool?
@desertaxle our job_watch_timeout_seconds is 86400 seconds (= 1 day)
@lucylu-coveo I've updated the worker to not mark the flow run as crashed when the container is still running in https://github.com/PrefectHQ/prefect/pull/15525. It doesn't address the root cause but will prevent flow runs from being errantly marked as crashed. I've also improved the error logging so that we can get more insight into what's causing the failure. I'll let you know when the changes are released so you can try them out!
A new version of the Kubernetes worker that aims to resolve this issue has been released in prefect_kubernetes==0.4.5
. You can install this version with the prefecthq/prefect:2.20.9
images with the -kubernetes
suffix. Please try this new version to confirm if it resolves the issue.
@desertaxle, I'm not sure if it's related, but even before upgrading to 2.20.9 (what I did ~1h ago), during the night we had weird issue for the first time:
As you can see, both tasks were successful, and after 12s it should be marked as Completed. But Prefect wasn't able to mark it, and it was waiting for 5h timeout we have set, then it was killed with State changed due to long running. Time out.
error.
It's not happening for all flows again, only for some of them and I don't see any pattern yet, but maybe it's already worth to inform you about it.
Hi @k-wolski, I was debugging this with @desertaxle and we realized this was a misfire of your automation, which saw the flow run's Completed
event before it saw the Running
event. I believe this is related to a Prefect Cloud configuration change I made Tuesday that would have made this case slightly more likely. I'll be adjusting that parameter to prevent these kinds of issues from cropping up. Sorry about that!
Thank you for your response @chrisguidry. It's great to see that you figured out root cause on your end, and it also explains why it appeared only for some runs.
Going back to main topic in this issue, I didn't found any issues with latest version, but I'm also not actively working on that layer so it doesn't mean it fully disappeared. I would wait for @lucylu-coveo to confirm if it's working fine, and then probably we can close it.
Hello :wave: @k-wolski I'm a colleague of Lucy! We'll try the new version and see if it solve the issue :)
We experience this issue too - I'll update our workers and report back for another data point.
We often, but not exclusively, see this with pods that are slow to start for example if they need to download a big container image or scale up a node.
Ok I think our issue is in a similar area of the code but not the same, I've opened a new issue: https://github.com/PrefectHQ/prefect/issues/15622
I suspect the root cause here may be similar though, the fact that kubernetes_asyncio does not seem to respect timeouts.
I debugged the issue, there are two possible issue causing this
kubernetes_asyncio
library instead of official kubernetes client. There is bug in kubernetes_asyncio
where any stream or watch request throws asyncio.TimeoutError
after 5 mins. The bug fix MR is already raised https://github.com/tomplus/kubernetes_asyncio/pull/337Connection reset by peer
exception occurs.
Bug summary
Not every time, but regularly for failed deployment run, I see such error log:
It never happened when we were running on prefect agent and started appearing after migration.
prefect-worker is running on kubernetes (both k3s and Azure AKS) with similar setup, and on both environments such logs are appearing.
base-job-template (anonymized):
In case any additional might help, please let me know what is needed. The problem is that it's not deterministic issue, so for the same failure it is appearing only on some occasions and it's hard to say what is causing that. For example when the issue appeared for failed run, I retried it from Prefect cloud UI, and with second run it didn't happen even though run failed in exactly the same way.
Version info (
prefect version
output)Additional context
No response