apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.44k stars 14.36k forks source link

SparkSubmitOperator not mark task as success after spark job complete job #39524

Closed dannyeuu closed 6 months ago

dannyeuu commented 6 months ago

Apache Airflow Provider(s)

apache-spark

Versions of Apache Airflow Providers

apache-airflow-providers-apache-hive==6.1.2 apache-airflow-providers-apache-spark==4.1.1 apache-airflow-providers-cncf-kubernetes==7.3.0 apache-airflow-providers-common-sql==1.6.0

Apache Airflow version

2.6.3

Operating System

Debian GNU/Linux 11 (bullseye)

Deployment

Official Apache Airflow Helm Chart

Deployment details

Dockerfile FROM image apache/airflow:2.6.3-python3.10 airflow helm Chart

What happened

We run lots of jobs every day, but every day one or two tasks the Spark complete the job, but the task still keeps in running state with no new logs, until manually change the state t success.

Task with wrong behavior, the logs keep like this:

[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:13 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Running)
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:13 INFO LoggingPodStatusWatcherImpl: State changed, new state:
[2024-05-09, 06:25:13 -03] {spark_submit.py:476} INFO - Identified spark driver pod: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - pod name: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - namespace: spark
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - labels: Client -> 3irmaos, Environment -> production, Name -> etl-ira-ciss---3irmaos, Product -> ira, Role -> process, Step -> refinement, airflow-attempt -> 1, airflow-map-index -> , airflow-spark-task-build-version -> spark-k8s-1.0.0, airflow_dag -> cluster_ira_ciss_daily_3irmaos, airflow_task -> load.load_client_product_3irmaos, airflow_task_uuid -> ff01d563-5ea8-555c-8444-413b1f29a5ee, spark-affinty-label -> c8cd4607d6b14f3a81bcce821d0054bb, spark-app-name -> reverent-mcclintock, spark-app-selector -> spark-7edea139decf4f90bd5d0fbac97f8869, spark-role -> driver, spark-version -> 3.4.2, spark_job_execution_date -> 2024-05-09, spark_job_name -> load.load_client_product_3irmaos-2024-05-09, spotinst.io/restrict-scale-down -> true
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - pod uid: 9ed68d10-c28e-4a4b-8136-e68b9be49b9f
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - creation time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - service account name: spark
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - volumes: pod-template-volume, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-69jld
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - node name: i-0097e957e30fe4dd5
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - start time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - phase: Running
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container status:
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Running)
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: State changed, new state:
[2024-05-09, 06:25:14 -03] {spark_submit.py:476} INFO - Identified spark driver pod: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - pod name: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - namespace: spark
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - labels: Client -> 3irmaos, Environment -> production, Name -> etl-ira-ciss---3irmaos, Product -> ira, Role -> process, Step -> refinement, airflow-attempt -> 1, airflow-map-index -> , airflow-spark-task-build-version -> spark-k8s-1.0.0, airflow_dag -> cluster_ira_ciss_daily_3irmaos, airflow_task -> load.load_client_product_3irmaos, airflow_task_uuid -> ff01d563-5ea8-555c-8444-413b1f29a5ee, spark-affinty-label -> c8cd4607d6b14f3a81bcce821d0054bb, spark-app-name -> reverent-mcclintock, spark-app-selector -> spark-7edea139decf4f90bd5d0fbac97f8869, spark-role -> driver, spark-version -> 3.4.2, spark_job_execution_date -> 2024-05-09, spark_job_name -> load.load_client_product_3irmaos-2024-05-09, spotinst.io/restrict-scale-down -> true
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - pod uid: 9ed68d10-c28e-4a4b-8136-e68b9be49b9f
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - creation time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - service account name: spark
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - volumes: pod-template-volume, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-69jld
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - node name: i-0097e957e30fe4dd5
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - start time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - phase: Succeeded
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container status:
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Succeeded)
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Container final statuses:
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application reverent-mcclintock with submission ID spark:reverent-mcclintock-2cf0078f5caa813e-driver finished

What you think should happen instead

Expected behavior: a normal task will have the taskinstance.py part marking the task as success :

[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:24:04Z
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:07Z
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:08 INFO LoggingPodStatusWatcherImpl: Application elegant-rubin with submission ID spark:elegant-rubin-91cab58f5caaaab2-driver finished
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Shutdown hook called
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-175bae5e-77d3-45b1-aa3c-cc88bdd8a462
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-70c6860c-1925-4858-be1f-23e9c6d5723a
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-236d1121-2075-4b6e-bb54-2e3e66decd6a
[2024-05-09, 06:25:09 -03] {taskinstance.py:1345} INFO - Marking task as SUCCESS. dag_id=cluster_ira_ciss_daily_3irmaos, task_id=load.load_competitor_store_3irmaos, execution_date=20240508T070000, start_date=20240509T092347, end_date=20240509T092509
[2024-05-09, 06:25:09 -03] {local_task_job_runner.py:225} INFO - Task exited with return code 0
[2024-05-09, 06:25:09 -03] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check

How to reproduce

It's totally random, happens in different dags with different tasks

Anything else

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 6 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

RNHTTR commented 6 months ago

Can you see if this is occurring on the latest version of the spark provider package (4.8.0)?

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions[bot] commented 6 months ago

This issue has been closed because it has not received response from the issue author.