Open lanking520 opened 3 years ago
Really appreciated if you can take a look at it @alexarchambault . Currently I am working on using Almond on Spark GPU clusters. Almond seemed to be a decent tool to help user start learning Scala.
I've spent some time debugging this. I haven't gotten it to work, but I do see what looks to be at least part of the problem. If you don't set the SPARK_HOME
environment variable, then https://github.com/alexarchambault/ammonite-spark/blob/v0.10.1/modules/core/src/main/scala/org/apache/spark/sql/ammonitesparkinternals/SparkDependencies.scala#L127 adds the spark-stubs dependency, which includes the class for fetching classes from the driver. However, https://github.com/alexarchambault/ammonite-spark/blob/v0.10.1/modules/core/src/main/scala/org/apache/spark/sql/ammonitesparkinternals/AmmoniteSparkSessionBuilder.scala#L262 filters out that dependency, so it's never sent to the executors. If I interpret this correctly, this is a bug.
I do not believe fixing that problem will suffice. I do not know how Spark runs executors on YARN, but in standalone mode the executor process command looks like
java -cp $SPARK_HOME/conf:$SPARK_HOME/jars/*:$HIVE_CONF_DIR ...
You can find that in the executor's stderr log.
Spark tries to load ExecutorClassLoader
from its class loader, but I don't know the search path for that class loader. I know that if you specify a value for the spark.executor.extraClassPath
spark configuration property it will be prepended to the classpath argument when starting an executor. So the executor command becomes
java -cp whatever:I:put:in:spark.executor.extraClassPath:$SPARK_HOME/conf:$SPARK_HOME/jars/*:$HIVE_CONF_DIR ...
I tried adding the spark-stubs_3.0 jar file to this classpath, along with ammonite-spark and almond-spark. However, the executor still loaded the standard ExecutorClassLoader
.
At this point I need to stop to make progress on other efforts, but I wanted to share my findings in the hope that someone else will make further progress.
@mallman Thanks for your support and testing. I guess standalone mode were not well supported. Especially, it should not look for HDFS in the first place. In the above configuration, there is no setup for HDFS.
@lanking520 I found a workaround. Get spark-stubs_30_2.12-0.10.1.jar
from https://search.maven.org/artifact/sh.almond/spark-stubs_30_2.12/0.10.1/jar. Put it somewhere where the executors will be able to load it from their filesystem (i.e. in each worker node's filesystem). In my case, I'm using NFS so I put it there and the executors can read the file from NFS. Set the spark.executor.extraClassPath
spark configuration setting to the executor filesystem location of spark-stubs_30_2.12-0.10.1.jar
. Follow the standard instructions for constructing a SparkSession
from a NotebookSparkSession.builder()
and try from there. I believe this should work. LMK
@mallman That's awesome. Can I set this up from the notebook? Or this has to be setup outside when I launch the executor? Since I setup worker and master from the same machine, so I guess there is only one path needded.
@lanking520 I don't know about how you distribute the spark-stubs jar file. I don't have an automated way of doing that. However, you can do the rest in your notebook.
@mallman Yeah, got it. Probably I need to add this into my spark.conf
file before I launch the worker node.
@lanking520 Hi, i facing same issue. What your solution to resolve it?. I use almond 0.13.1, scala 2.12.15, spark 3.2.2. java 1.8.0. I submited job into YARN. Driver was inited, but executor alway dead because error above. Some error occured like this:
23/04/21 18:57:18 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.1.30.xxx:35502) with ID 1, ResourceProfileId 0
23/04/21 18:57:18 INFO BlockManagerMasterEndpoint: Registering block manager kdl-dev-xxx-slave-01.xxx.com.vn:50004 with 366.3 MiB RAM, BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 ERROR YarnScheduler: Lost executor 1 on kdl-dev-xxx-slave-01.xxx.com.vn: Unable to create executor due to null
23/04/21 18:57:19 INFO DAGScheduler: Executor lost: 1 (epoch 0)
23/04/21 18:57:19 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
23/04/21 18:57:19 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
23/04/21 18:57:19 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
23/04/21 18:57:21 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container from a bad node: container_e108_1680686832261_1550_01_000002 on host: kdl-dev-xxx-slave-01.xxx.com.vn. Exit status: 1. Diagnostics: [2023-04-21 18:57:20.678]Exception from container-launch.
Container id: container_e108_1680686832261_1550_01_000002
Exit code: 1
Exception message: Launch container failed
Shell output: main : command provided 1
main : run as user is quanta
main : requested yarn user is xxx
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /data/hadoop/yarn/local/nmPrivate/application_1680686832261_1550/container_e108_1680686832261_1550_01_000002/container_e108_1680686832261_1550_01_000002.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
[2023-04-21 18:57:20.680]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
bank.com.vn
23/04/21 18:57:19 WARN util.Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 50000. Attempting port 50001.
23/04/21 18:57:19 WARN util.Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 50001. Attempting port 50002.
23/04/21 18:57:19 WARN util.Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 50002. Attempting port 50003.
23/04/21 18:57:19 WARN util.Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 50003. Attempting port 50004.
23/04/21 18:57:19 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50004.
23/04/21 18:57:19 INFO netty.NettyBlockTransferService: Server created on kdl-dev-xxx-slave-01.xxx.com.vn:50004
23/04/21 18:57:19 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/04/21 18:57:19 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 INFO executor.Executor: Using REPL class URI: http://10.7.20.xxx:57490
23/04/21 18:57:19 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
23/04/21 18:57:20 ERROR executor.YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to null
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:169)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: No FileSystem for scheme: http
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:68)
... 17 more
23/04/21 18:57:20 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
23/04/21 18:57:20 INFO memory.MemoryStore: MemoryStore cleared
23/04/21 18:57:20 INFO storage.BlockManager: BlockManager stopped
23/04/21 18:57:20 INFO util.ShutdownHookManager: Shutdown hook called
Thank you
I am trying to use Spark 3.0 with the local standalone cluster setup. I just simply create 1 master and 1 worker locally. However, the job is keep crashing with the issue
It seemed all jars are uploaded to the Spark remote server and Spark trying to fetch them.
You can reproduce by using the above code (and please create a standalone cluster beforehand
637 has the same issue it seemed