apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.48k stars 4.52k forks source link

[Improvement][common] get application id in SHELL scripts #4025

Closed gabrywu closed 6 days ago

gabrywu commented 3 years ago

Describe the question For now, if we execute a yarn job in a SHELL script, we find the application IDs in the logs by regex 'application\d+\d+'. I think it's so ugly and has performance issues. So I suggest that we register an aspect when executing 'yarn jar' command, we can weave a join point to org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication, where we can get the submitted application id and the tracking URL, and output them into one local file

What are the current deficiencies and the benefits of improvement

Which version of DolphinScheduler:

Describe alternatives you've considered

add the following two env to global envs export YARN_CLIENT_OPTS="-javaagent:/pathto/aspectjweaver-1.9.6.jar"

export YARN_USER_CLASSPATH=/pathto/Aop2YarnClient-1.0-SNAPSHOT.jar Then when submitting applications to the yarn cluster, the aspect in Aop2YarnClient-1.0-SNAPSHOT.jar will be registered, and we can get the submitted application id and the tracking URL

This is an example, I just output the application id to console image

Here is the sample code image

The solution is suitable for Hive, Spark, Flink, and other tools running the yarn cluster. 'hive -e 'hive sql'' test passed

CalvinKirs commented 3 years ago

I think this is a good idea

gabrywu commented 3 years ago

This is a public repo which can achieve this function, https://github.com/gabrywu/Aop2YarnClient

xiejiajun commented 3 years ago

it will not be able to fetch the applicationId in the case of use HiveServer2 submitting the SQL, should we consider storing the appId information in public storage? @gabrywu

gabrywu commented 3 years ago

it will not be able to fetch the applicationId in the case of use HiveServer2 submitting the SQL, should we consider storing the appId information in public storage? @gabrywu

Do you have any good ideas to resolve it? @xiejiajun

xiejiajun commented 3 years ago

it will not be able to fetch the applicationId in the case of use HiveServer2 submitting the SQL, should we consider storing the appId information in public storage? @gabrywu

Do you have any good ideas to resolve it? @xiejiajun

I thought about writing the appId to a public storage such as Mysql, but it will introduce additional third-party service configuration such as JdbcUrl , so we still need to think about it carefully.

gabrywu commented 3 years ago

it will not be able to fetch the applicationId in the case of use HiveServer2 submitting the SQL, should we consider storing the appId information in public storage? @gabrywu

Do you have any good ideas to resolve it? @xiejiajun

I thought about writing the appId to a public storage such as Mysql, but it will introduce additional third-party service configuration such as JdbcUrl , so we still need to think about it carefully.

Yes, so the example project just put it to a local file

ruanwenjun commented 1 year ago

@caishunfeng