apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.01k stars 14.27k forks source link

SparkSubmitHook requires yarn binary #41324

Closed EugeneMSOff closed 1 month ago

EugeneMSOff commented 3 months ago

Apache Airflow Provider(s)

apache-spark

Versions of Apache Airflow Providers

4.9.0

Apache Airflow version

2.9.3

Operating System

Debian GNU/Linux 12 (bookworm)

Deployment

Docker-Compose

Deployment details

No response

What happened

I try to use spark submit with --master yarn --deploy-mode cluster parameters. And I wanna to kill application on cluster when I terminate SparkSubmitOperator.

SparkSubmitHook has a on_kill() method, which run yarn application -kill command: https://github.com/apache/airflow/blob/45658a8963761ce8a565b481156c847e493fce67/airflow/providers/apache/spark/hooks/spark_submit.py#L709 But it's not worked, because of no such binary in the PATH (yarn). My Airflow instance run on host, without hadoop installation.

In the hook docstring only spark-submit mentioned as required, no word about yarn https://github.com/apache/airflow/blob/45658a8963761ce8a565b481156c847e493fce67/airflow/providers/apache/spark/hooks/spark_submit.py#L42

What you think should happen instead

As an option, change the state of application with rest api https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html

How to reproduce

Use Spark connection type with deploy mode cluster and host 'yarn' fields. Run SparkSubmitOperator and mark state as "failed".

Anything else

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 3 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

sunank200 commented 2 months ago

@EugeneMSOff Is the YARN binary available in the PATH of your Airflow instance? It seems that the YARN binary may not be available in your current environment, which could be the reason you're unable to kill Spark applications on the YARN cluster using the on_kill() method. Would you like to attempt installing the YARN binary in your Docker setup and try again?

If you wish, you can create a PR to clarify that when YARN is used as the cluster manager (i.e., when --master yarn is specified), the YARN binary must be available in the PATH of the Airflow instance. This is necessary for operations such as killing YARN applications via the on_kill() method in SparkSubmitHook.

Alternatively, we could consider adding a new feature, using the YARN ResourceManager REST API to manage the application state (e.g., killing), if the YARN binary is unavailable in on_kill() method.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

EugeneMSOff commented 2 months ago

@sunank200 first of all, thanks for your attention.

As I said, such binary is not in PATH of my Airflow instance. yarn tool is provided by hadoop installation. If I just copy it to Airflow instance, it won't be enough

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions[bot] commented 1 month ago

This issue has been closed because it has not received response from the issue author.