Error running batch job with spark-submit

kitgary commented 8 years ago

Hi,

I failed to run the batch job on yarn in demo "Saving to HDFS and Executing on YARN", here's the error log.

16/12/01 11:56:16 INFO yarn.Client: client token: N/A diagnostics: Application application_1480592911767_0001 failed 2 times due to AM Container for appattempt_1480592911767_0001_000002 exited with exitCode: 1 For more detailed output, check application tracking page:http://lambda-pluralsight:8088/cluster/app/application_1480592911767_0001Then, click on links to logs of each attempt.

Diagnostics: Exception from container-launch.

Container id: container_1480592911767_0001_02_000001

Exit code: 1

Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1

Failing this attempt. Failing the application.

ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1480593276668 final status: FAILED tracking URL: http://lambda-pluralsight:8088/cluster/app/application_1480592911767_0001 user: vagrant

16/12/01 11:56:16 WARN yarn.Client: Failed to cleanup staging dir .sparkStaging/application_1480592911767_0001

java.net.ConnectException: Call From lambda-pluralsight/127.0.0.1 to lambda-pluralsight:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) at org.apache.hadoop.ipc.Client.call(Client.java:1480) at org.apache.hadoop.ipc.Client.call(Client.java:1407) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy13.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2113) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424) at org.apache.spark.deploy.yarn.Client.cleanupStagingDir(Client.scala:167) at org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:977) at org.apache.spark.deploy.yarn.Client.run(Client.scala:1031) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529) at org.apache.hadoop.ipc.Client.call(Client.java:1446) ... 31 more

Exception in thread "main" org.apache.spark.SparkException: Application application_1480592911767_0001 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 16/12/01 11:56:17 INFO util.ShutdownHookManager: Shutdown hook called 16/12/01 11:56:17 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-9fe8fe45-851a-41ad-8409-3daf17e08a5d

And the log shows

java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all.

Thanks Gary

aalkilani commented 7 years ago

Can you please make sure you're running the latest version of the image and git scripts. They have both changed significantly in the last few days and weeks that it makes sense to just start there.

See which version of the vagrant image you have: vagrant box list Do you have 0.0.6 or something else?

If you have something older than 0.0.5 then let's get you on the latest box. Here's what to do (it will require some network bandwidth to download the new image) and this will permanently destroy the current one you have so be sure to backup or take a copy of anything you have that's on there.

Get latest from git, from within the directory spark-kafka-cassandra-applying-lambda-architecture

git pull origin master

Destroy old box, run from spark-kafka-cassandra-applying-lambda-architecture/vagrant directory

vagrant destroy

Get latest box version, run from spark-kafka-cassandra-applying-lambda-architecture/vagrant directory

vagrant box update

Start new vagrant box

vagrant up

Thanks

kitgary commented 7 years ago

Thanks! I get it working!

I checked that I had the latest box 0.0.6, but after destroying the old box and starting a new one, everything worked fine. It's kind of weird...oz..

Thanks again.

azzam-krya commented 7 years ago

Hi,

I failed when run batchjob to yarn I already use boxes v0.0.6 and the errol log is:

`16/12/06 06:02:51 INFO yarn.Client: Application report for application_1480993486657_0008 (state: FAILED) 16/12/06 06:02:51 INFO yarn.Client: client token: N/A diagnostics: Application application_1480993486657_0008 failed 2 times due to AM Container for appattempt_1480993486657_0008_000002 exited with exitCode: 10 For more detailed output, check application tracking page:http://lambda-pluralsight:8088/cluster/app/application_1480993486657_0008Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1480993486657_0008_02_000001 Exit code: 10 Stack trace: ExitCodeException exitCode=10: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 10 Failing this attempt. Failing the application. ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1481004161092 final status: FAILED tracking URL: http://lambda-pluralsight:8088/cluster/app/application_1480993486657_0008 user: vagrant Exception in thread "main" org.apache.spark.SparkException: Application application_1480993486657_0008 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 16/12/06 06:02:51 INFO util.ShutdownHookManager: Shutdown hook called 16/12/06 06:02:51 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-08705c9f-0994-4da8-8b48-82eb9282313f `

aalkilani commented 7 years ago

There isn't enough information in the logs provided here to nail down the cause of the problem. Would you mind checking the fixes.sh file under the vagrant directory. In the fixes.sh file, look for a section called: # spark-defaults

If you don't see it, then you simply need to update the project from git and do a vagrant reload --provision like so:

git pull origin master vagrant reload --provision

That should take care of it. The problem this fixes is that the spark defaults were too high for the very limited resources the VM is working with so that sections takes care of adding some defaults that should work for everyone and it results in editing the file /pluralsight/spark/conf/spark-defaults.conf

aalkilani commented 7 years ago

Closing this issue as original poster has this resolved now. @azzam-krya , if you're still having problems, ensure you're running the code as the root user. To get root, run the following. sudo su -

If you have further problems, please open another ticket and kindly provide the following:

User used to run command
Command used
Provide full error log. To do so. Follow these steps:

Browse to http://lambda-pluralsight:8088/
Click on the "History" link of the application that failed in the rightmost column under Tracking UI
Click on Logs link under Logs column
Click on the stderr link (first link)
That will show a page with a message up top "Showing ... bytes. Click here for full log" .. Go ahead and click on the "here" link
Copy all the logs from there and provide as input for the ticket.

azzam-krya commented 7 years ago

it works after I restart vm. thanks

robbie70 commented 6 years ago

I am having a similar problem as this one and raised this ticket for it today -as I didnt see this ticket previously.

https://github.com/aalkilani/spark-kafka-cassandra-applying-lambda-architecture/issues/27

aalkilani / spark-kafka-cassandra-applying-lambda-architecture