Closed paulinjo closed 1 year ago
This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.
Have there been any updates which would have addressed this?
Can you provide a reproduction that does not rely on spot instance eviction? We will need to be able to test changes to resolve this. Ideally the example would not require AWS.
A possible solution is to report flow runs as CRASHED if the infrastructure cannot be found to report a status.
Unfortunately I cannot provide a reproduction outside of spot instance eviction.
All of our EKS clusters use exclusively spot instances for ETL jobs to cut back on cost, so this is entirely representative of our workloads.
We also had another instance of this last night which proved to be very disruptive, since some external systems rely on accurate flow run state.
@paulinjo I think the handling added for STOPPED jobs and missing containers in this PR https://github.com/PrefectHQ/prefect/pull/10125 should resolve the issue you're seeing.
After the release today, could you try upgrading your agent and runtime environment to 2.10.21?
Sure thing.
Despite #10125, there are still reports of flow runs not being marked as Crashed correctly when spot instances are revoked. We are continuing to investigate
Possibly connected to #10141
After testing internally, we think the issue after #10125 is the pod status does not have termination information, resulting in this error
An error occurred while monitoring flow run '3baff259-cfac-4685-a4b0-2fc33504685e'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
result = await self.run(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 530, in run
status_code = await run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 857, in _watch_job
return first_container_status.state.terminated.exit_code
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'exit_code'
The fix in https://github.com/PrefectHQ/prefect-kubernetes/pull/85 should resolve the issue and we will backport to Prefect Agents too.
This issue should be resolved with the release of Prefect 2.11.4 today. Please let us know if you still experience issues!
First check
Bug summary
We run Prefect on an EKS cluster made primarily of EC2 spot instances. After receiving a BidEvictedEvent, the
aws-node-termination-handler
will drain the node gracefully, killing any Prefect job pods which may be running on it.Even though the Prefect agent raises an error that the job container cannot be found, Prefect cloud will leave the job in a running state instead of marking it as crashed.
The flow run is using a Kubernetes infrastructure block.
Reproduction
Logs
Versions
Additional context
No response