DTStack / Taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display
https://dtstack.github.io/Taier/
Apache License 2.0
1.32k stars 331 forks source link

[Bug] Spark jar task failed to run #1114

Open Narcasserun opened 1 year ago

Narcasserun commented 1 year ago

Search before asking

What happened

When configuring the cluster components and running the Spark jar task, it was found that it could not run successfully

What you expected to happen

image

How to reproduce

I ran a spark pi task with parameters of 10 or 100, and the Application Master would link the parameters as hosts

Application application_1693541457708_0007 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1693541457708_0007_000001 exited with exitCode: 10 Failing this attempt.Diagnostics: [2023-09-01 15:56:35.718]Exception from container-launch. Container id: container_e130_1693541457708_0007_01_000001 Exit code: 10 [2023-09-01 15:56:35.719]Container exited with a non-zero exit code 10. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : etrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver! at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:579) at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:434) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:256) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:766) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:764) at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:787) at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) 23/09/01 15:56:35 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!) 23/09/01 15:56:35 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!) 23/09/01 15:56:35 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://lcc-ambari-server01:8020/user/admin/.sparkStaging/application_1693541457708_0007 23/09/01 15:56:35 INFO util.ShutdownHookManager: Shutdown hook called [2023-09-01 15:56:35.719]Container exited with a non-zero exit code 10. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : etrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:33 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:34 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Failed to connect to driver at 10:0, retrying ... 23/09/01 15:56:35 ERROR yarn.ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver! at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:579) at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:434) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:256) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:766) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:764) at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:787) at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) 23/09/01 15:56:35 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!) 23/09/01 15:56:35 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!) 23/09/01 15:56:35 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://lcc-ambari-server01:8020/user/admin/.sparkStaging/application_1693541457708_0007 23/09/01 15:56:35 INFO util.ShutdownHookManager: Shutdown hook called For more detailed output, check the application tracking page: http://lcc-ambari-server01:8188/applicationhistory/app/application_1693541457708_0007 Then click on links to logs of each attempt. . Failing the application.

Anything else

No response

Version

master

Are you willing to submit PR?

Code of Conduct

Narcasserun commented 1 year ago

@vainhope @mortalYoung

vainhope commented 1 year ago

从日志中看,是spark任务的AppMaster无法连接至Driver,所以任务失败 确认下是否有网络不通的问题呢

Narcasserun commented 1 year ago

它会拿taier上spark jar任务的输入参数,作为dirver的host, 0 作为port, 我试了不同的spark jar 任务,都是一样的问题 @vainhope