bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.97k stars 9.2k forks source link

SparkKubernetesOperator task fails even if the job is successful on airflow #29896

Open naveenbussari opened 1 week ago

naveenbussari commented 1 week ago

Name and Version

bitnami/airflow:2.6.0

What architecture are you using?

amd64

What steps will reproduce the bug?

  1. Running airflow2.6.0 on EKS using Kubernetes executor
  2. Using remote s3 for logging
  3. Deploy a dag with a task using sparkKubernetesOperator
  4. Trigger the job

What is the expected behavior?

The airflow pod creates a spark application that creates a driver pod. The spark-driver goes through the following stages pending->ContainerCreating->Running->Completed. On the airflow UI, the task should fail if the driver pod is in an Error state (or similar states that refer the pod as failed)

What do you see instead?

The spark job runs fine. Executor pods are created and spark application is successfully completed. But on the airflow UI, the task is marked as failed. The logs suggest that the driver pod is in 'ContainerCreating' state. But the driver pod state changed from Pending->ContainerCreating->Running within a few seconds (around 20).

I am attaching the logs.

route-deviations-map-match-deviations-z123vrv9
*** Found logs in s3:
***   * s3://k8s-airflow-cluster-logs/maps-data/prod/dag_id=route-deviations/run_id=manual__2024-10-14T07:47:05+00:00/task_id=map-match-deviations/attempt=1.log
[2024-10-14, 11:53:30 UTC] {taskinstance.py:1125} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: route-deviations.map-match-deviations manual__2024-10-14T07:47:05+00:00 [queued]>
[2024-10-14, 11:53:30 UTC] {taskinstance.py:1125} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: route-deviations.map-match-deviations manual__2024-10-14T07:47:05+00:00 [queued]>
[2024-10-14, 11:53:30 UTC] {taskinstance.py:1331} INFO - Starting attempt 1 of 1
[2024-10-14, 11:53:30 UTC] {taskinstance.py:1350} INFO - Executing <Task(SparkKubernetesOperator): map-match-deviations> on 2024-10-14 07:47:05+00:00
[2024-10-14, 11:53:30 UTC] {standard_task_runner.py:57} INFO - Started process 22 to run task
[2024-10-14, 11:53:30 UTC] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'route-deviations', 'map-match-deviations', 'manual__2024-10-14T07:47:05+00:00', '--job-id', '723', '--raw', '--subdir', 'DAGS_FOLDER/git_maps/route-deviations.py', '--cfg-path', '/tmp/tmpahk0bnip']
[2024-10-14, 11:53:30 UTC] {standard_task_runner.py:85} INFO - Job 723: Subtask map-match-deviations
[2024-10-14, 11:53:30 UTC] {task_command.py:410} INFO - Running <TaskInstance: route-deviations.map-match-deviations manual__2024-10-14T07:47:05+00:00 [running]> on host route-deviations-map-match-deviations-z123vrv9
[2024-10-14, 11:53:30 UTC] {pod_generator.py:529} WARNING - Model file  does not exist
[2024-10-14, 11:53:31 UTC] {taskinstance.py:1568} INFO - Exporting env vars: AIRFLOW_CTX_DAG_EMAIL='joshua.d@xxxxx.com' AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='route-deviations' AIRFLOW_CTX_TASK_ID='map-match-deviations' AIRFLOW_CTX_EXECUTION_DATE='2024-10-14T07:47:05+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2024-10-14T07:47:05+00:00'
[2024-10-14, 11:53:31 UTC] {base.py:73} INFO - Using connection ID 'kubernetes_default' for task execution.
[2024-10-14, 11:53:51 UTC] {spark_kubernetes.py:119} INFO - 
 printing the driver pod status : Pending
[2024-10-14, 11:53:51 UTC] {taskinstance.py:1847} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py", line 120, in execute
    for line in pod_log_stream:
  File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 163, in stream
    resp = func(*args, **kwargs)
  File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 23747, in read_namespaced_pod_log
    return self.read_namespaced_pod_log_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 23866, in read_namespaced_pod_log_with_http_info
    return self.api_client.call_api(
  File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/kubernetes/client/rest.py", line 240, in GET
    return self.request("GET", url,
  File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '82f836dc-b272-4242-be3b-0f148708f027', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Mon, 14 Oct 2024 11:53:51 GMT', 'Content-Length': '234'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \\"spark-kubernetes-driver\\" in pod \\"map-match-deviations-723-driver\\" is waiting to start: ContainerCreating","reason":"BadRequest","code":400}\n'
[2024-10-14, 11:53:51 UTC] {taskinstance.py:1368} INFO - Marking task as FAILED. dag_id=route-deviations, task_id=map-match-deviations, execution_date=20241014T074705, start_date=20241014T115330, end_date=20241014T115351
[2024-10-14, 11:53:51 UTC] {logging_mixin.py:149} WARNING - /opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/utils/email.py:152 RemovedInAirflow3Warning: Fetching SMTP credentials from configuration variables will be deprecated in a future release. Please set credentials using a connection instead.
[2024-10-14, 11:53:51 UTC] {email.py:268} INFO - Email alerting: attempt 1
[2024-10-14, 11:53:51 UTC] {email.py:280} INFO - Sent an alert email to ['joshua.d@xxxxx.com']
[2024-10-14, 11:53:51 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 723 for task map-match-deviations ((400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '82f836dc-b272-4242-be3b-0f148708f027', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Mon, 14 Oct 2024 11:53:51 GMT', 'Content-Length': '234'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \\"spark-kubernetes-driver\\" in pod \\"map-match-deviations-723-driver\\" is waiting to start: ContainerCreating","reason":"BadRequest","code":400}\n'
; 22)
[2024-10-14, 11:53:51 UTC] {local_task_job_runner.py:232} INFO - Task exited with return code 1

Additional information

I tried using the configs but it isn't solved.

extraEnvVars:
- name: AIRFLOW__WEBSERVER__LOG_FETCH_TIMEOUT_SEC
      value: "120"
    - name: AIRFLOW__WEBSERVER__LOG_FETCH_DELAY_SEC
      value: "60"
carrodher commented 1 week ago

Hi, the issue may not be directly related to the Bitnami container image/Helm chart, but rather to how the application is being utilized, configured in your specific environment, or tied to a particular scenario that is not easy to reproduce on our side.

If you think that's not the case and want to contribute a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

Suppose you have any questions about the application, customizing its content, or technology and infrastructure usage. In that case, we highly recommend that you refer to the forums and user guides provided by the project responsible for the application or technology.

With that said, we'll keep this ticket open until the stale bot automatically closes it, in case someone from the community contributes valuable insights.

naveenbussari commented 1 week ago

The issue here seems to be that the webserver doesn't wait long enough before a pod (spark driver pod in this case) changes it's phase from ContainerCreating to Running.