Open tatiana opened 4 days ago
I released a new version of the Ray provider, swapping the lines and adding additional logs so we can understand what the original exception is: https://github.com/astronomer/astro-provider-ray/releases/tag/v0.3.0a6 https://pypi.org/project/astro-provider-ray/0.3.0a6/
After upgrading the customer faced this issue in their Astro deployment:
example-dag-ray-provider-process-data-with-ray-p16as5bx
*** No logs found on s3 for ti=<TaskInstance: example_dag_ray_provider.process_data_with_ray manual__2024-10-10T03:05:51.997591+00:00 [running]>
*** Attempting to fetch logs from pod example-dag-ray-provider-process-data-with-ray-p16as5bx through kube API
*** Reading from k8s pod logs failed: ('Cannot find pod for ti %s', <TaskInstance: example_dag_ray_provider.process_data_with_ray manual__2024-10-10T03:05:51.997591+00:00 [running]>)
and this from scheduler logs
[2024-10-09T21:19:46.001-0700] {kubernetes_executor_utils.py:98} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/astronomer/kubernetes/executors/kubernetes_executor_utils.py", line 85, in run
self.resource_version = self._run(
^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/astronomer/kubernetes/executors/kubernetes_executor_utils.py", line 153, in _run
for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 195, in stream
raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 332505703 (332546384)
Process KubernetesJobWatcher-3:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.11/site-packages/astronomer/kubernetes/executors/kubernetes_executor_utils.py", line 85, in run
self.resource_version = self._run(
^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/astronomer/kubernetes/executors/kubernetes_executor_utils.py", line 153, in _run
for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 195, in stream
raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 332505703 (332546384)
The CRE team will continue the support in https://astronomer.zendesk.com/agent/tickets/65669/
Context
When attempting to run the following DAG in Astro withing a GKE cluster, the task is freezing.
Issue
It seems that the Ray SubmitRayJob is doing tons of things on a block that's catching a generic Python Exception, and we are hiding the error because it logs the original exception only after it attempts to delete the cluster, which can fail by itself:
https://github.com/astronomer/astro-provider-ray/blob/a900d439d02e1b59f980d5a2275f70ef0a05be93/ray_provider/operators/ray.py#L264-L312
By swapping lines 311 and 312, we (and the customer) will be able to see the original problem.