Closed EugeneMSOff closed 1 month ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
@EugeneMSOff Is the YARN binary available in the PATH of your Airflow instance? It seems that the YARN binary may not be available in your current environment, which could be the reason you're unable to kill Spark applications on the YARN cluster using the on_kill()
method. Would you like to attempt installing the YARN binary in your Docker setup and try again?
If you wish, you can create a PR to clarify that when YARN is used as the cluster manager (i.e., when --master yarn is specified), the YARN binary must be available in the PATH of the Airflow instance. This is necessary for operations such as killing YARN applications via the on_kill()
method in SparkSubmitHook
.
Alternatively, we could consider adding a new feature, using the YARN ResourceManager REST API to manage the application state (e.g., killing), if the YARN binary is unavailable in on_kill()
method.
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
@sunank200 first of all, thanks for your attention.
As I said, such binary is not in PATH of my Airflow instance.
yarn
tool is provided by hadoop
installation.
If I just copy it to Airflow instance, it won't be enough
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
Apache Airflow Provider(s)
apache-spark
Versions of Apache Airflow Providers
4.9.0
Apache Airflow version
2.9.3
Operating System
Debian GNU/Linux 12 (bookworm)
Deployment
Docker-Compose
Deployment details
No response
What happened
I try to use spark submit with --master yarn --deploy-mode cluster parameters. And I wanna to kill application on cluster when I terminate SparkSubmitOperator.
SparkSubmitHook has a on_kill() method, which run
yarn application -kill
command: https://github.com/apache/airflow/blob/45658a8963761ce8a565b481156c847e493fce67/airflow/providers/apache/spark/hooks/spark_submit.py#L709 But it's not worked, because of no such binary in the PATH (yarn
). My Airflow instance run on host, without hadoop installation.In the hook docstring only
spark-submit
mentioned as required, no word aboutyarn
https://github.com/apache/airflow/blob/45658a8963761ce8a565b481156c847e493fce67/airflow/providers/apache/spark/hooks/spark_submit.py#L42What you think should happen instead
As an option, change the state of application with rest api https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html
How to reproduce
Use Spark connection type with deploy mode
cluster
and host 'yarn' fields. Run SparkSubmitOperator and mark state as "failed".Anything else
No response
Are you willing to submit PR?
Code of Conduct