[Improvement][common] get application id in SHELL scripts

gabrywu commented 3 years ago

Describe the question For now, if we execute a yarn job in a SHELL script, we find the application IDs in the logs by regex 'application\d+\d+'. I think it's so ugly and has performance issues. So I suggest that we register an aspect when executing 'yarn jar' command, we can weave a join point to org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication, where we can get the submitted application id and the tracking URL, and output them into one local file

What are the current deficiencies and the benefits of improvement

deficiency: need the aspectjweaver-1.9.6.jar file, which size is about 2MB
benefit: no need to retrieve the whole log with the regex 'application\d+\d+'. no need to restrict yarn client log level to INFO

Which version of DolphinScheduler:

all version

Describe alternatives you've considered

add the following two env to global envs export YARN_CLIENT_OPTS="-javaagent:/pathto/aspectjweaver-1.9.6.jar"

export YARN_USER_CLASSPATH=/pathto/Aop2YarnClient-1.0-SNAPSHOT.jar Then when submitting applications to the yarn cluster, the aspect in Aop2YarnClient-1.0-SNAPSHOT.jar will be registered, and we can get the submitted application id and the tracking URL

This is an example, I just output the application id to console

Here is the sample code

The solution is suitable for Hive, Spark, Flink, and other tools running the yarn cluster. 'hive -e 'hive sql'' test passed

CalvinKirs commented 3 years ago

I think this is a good idea

gabrywu commented 3 years ago

This is a public repo which can achieve this function, https://github.com/gabrywu/Aop2YarnClient

xiejiajun commented 3 years ago

it will not be able to fetch the applicationId in the case of use HiveServer2 submitting the SQL, should we consider storing the appId information in public storage? @gabrywu

gabrywu commented 3 years ago

it will not be able to fetch the applicationId in the case of use HiveServer2 submitting the SQL, should we consider storing the appId information in public storage? @gabrywu

Do you have any good ideas to resolve it? @xiejiajun

xiejiajun commented 3 years ago

it will not be able to fetch the applicationId in the case of use HiveServer2 submitting the SQL, should we consider storing the appId information in public storage? @gabrywu

Do you have any good ideas to resolve it? @xiejiajun

I thought about writing the appId to a public storage such as Mysql, but it will introduce additional third-party service configuration such as JdbcUrl , so we still need to think about it carefully.

gabrywu commented 3 years ago

it will not be able to fetch the applicationId in the case of use HiveServer2 submitting the SQL, should we consider storing the appId information in public storage? @gabrywu

Do you have any good ideas to resolve it? @xiejiajun

I thought about writing the appId to a public storage such as Mysql, but it will introduce additional third-party service configuration such as JdbcUrl , so we still need to think about it carefully.

Yes, so the example project just put it to a local file

ruanwenjun commented 1 year ago

@caishunfeng

apache / dolphinscheduler

[Improvement][common] get application id in SHELL scripts #4025