almond-sh / almond

A Scala kernel for Jupyter
https://almond.sh
BSD 3-Clause "New" or "Revised" License
1.6k stars 238 forks source link

Spark Standalone Cluster issue: [No FileSystem for scheme: http] #695

Open lanking520 opened 3 years ago

lanking520 commented 3 years ago

I am trying to use Spark 3.0 with the local standalone cluster setup. I just simply create 1 master and 1 worker locally. However, the job is keep crashing with the issue

20/11/23 15:55:24 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to null
...
Caused by: java.io.IOException: No FileSystem for scheme: http
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)

It seemed all jars are uploaded to the Spark remote server and Spark trying to fetch them.

import $ivy.`org.apache.spark:spark-sql_2.12:3.0.0`
import org.apache.spark.sql._

val spark = {
  NotebookSparkSession.builder()
    .master("spark://localhost:7077")
    .getOrCreate()
}
spark.conf.getAll.foreach(pair => println(pair._1 + ":" + pair._2))
def sc = spark.sparkContext
val rdd = sc.parallelize(1 to 100000000, 100)
val n = rdd.map(_ + 1).sum()

You can reproduce by using the above code (and please create a standalone cluster beforehand

curl -O https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
tar zxvf spark-3.0.0-bin-hadoop2.7.tgz
mv spark-3.0.0-bin-hadoop2.7/ spark
export SPARK_MASTER_HOST=localhost
export SPARK_WORKER_INSTANCES=1
./spark/sbin/start-master.sh
./spark/sbin/start-slave.sh spark://localhost:7077

637 has the same issue it seemed

lanking520 commented 3 years ago

Really appreciated if you can take a look at it @alexarchambault . Currently I am working on using Almond on Spark GPU clusters. Almond seemed to be a decent tool to help user start learning Scala.

mallman commented 3 years ago

I've spent some time debugging this. I haven't gotten it to work, but I do see what looks to be at least part of the problem. If you don't set the SPARK_HOME environment variable, then https://github.com/alexarchambault/ammonite-spark/blob/v0.10.1/modules/core/src/main/scala/org/apache/spark/sql/ammonitesparkinternals/SparkDependencies.scala#L127 adds the spark-stubs dependency, which includes the class for fetching classes from the driver. However, https://github.com/alexarchambault/ammonite-spark/blob/v0.10.1/modules/core/src/main/scala/org/apache/spark/sql/ammonitesparkinternals/AmmoniteSparkSessionBuilder.scala#L262 filters out that dependency, so it's never sent to the executors. If I interpret this correctly, this is a bug.

I do not believe fixing that problem will suffice. I do not know how Spark runs executors on YARN, but in standalone mode the executor process command looks like

java -cp $SPARK_HOME/conf:$SPARK_HOME/jars/*:$HIVE_CONF_DIR ...

You can find that in the executor's stderr log.

Spark tries to load ExecutorClassLoader from its class loader, but I don't know the search path for that class loader. I know that if you specify a value for the spark.executor.extraClassPath spark configuration property it will be prepended to the classpath argument when starting an executor. So the executor command becomes

java -cp whatever:I:put:in:spark.executor.extraClassPath:$SPARK_HOME/conf:$SPARK_HOME/jars/*:$HIVE_CONF_DIR ...

I tried adding the spark-stubs_3.0 jar file to this classpath, along with ammonite-spark and almond-spark. However, the executor still loaded the standard ExecutorClassLoader.

At this point I need to stop to make progress on other efforts, but I wanted to share my findings in the hope that someone else will make further progress.

lanking520 commented 3 years ago

@mallman Thanks for your support and testing. I guess standalone mode were not well supported. Especially, it should not look for HDFS in the first place. In the above configuration, there is no setup for HDFS.

mallman commented 3 years ago

@lanking520 I found a workaround. Get spark-stubs_30_2.12-0.10.1.jar from https://search.maven.org/artifact/sh.almond/spark-stubs_30_2.12/0.10.1/jar. Put it somewhere where the executors will be able to load it from their filesystem (i.e. in each worker node's filesystem). In my case, I'm using NFS so I put it there and the executors can read the file from NFS. Set the spark.executor.extraClassPath spark configuration setting to the executor filesystem location of spark-stubs_30_2.12-0.10.1.jar. Follow the standard instructions for constructing a SparkSession from a NotebookSparkSession.builder() and try from there. I believe this should work. LMK

lanking520 commented 3 years ago

@mallman That's awesome. Can I set this up from the notebook? Or this has to be setup outside when I launch the executor? Since I setup worker and master from the same machine, so I guess there is only one path needded.

mallman commented 3 years ago

@lanking520 I don't know about how you distribute the spark-stubs jar file. I don't have an automated way of doing that. However, you can do the rest in your notebook.

lanking520 commented 3 years ago

@mallman Yeah, got it. Probably I need to add this into my spark.conf file before I launch the worker node.

AnhQuanTran commented 1 year ago

@lanking520 Hi, i facing same issue. What your solution to resolve it?. I use almond 0.13.1, scala 2.12.15, spark 3.2.2. java 1.8.0. I submited job into YARN. Driver was inited, but executor alway dead because error above. Some error occured like this:

23/04/21 18:57:18 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.1.30.xxx:35502) with ID 1,  ResourceProfileId 0
23/04/21 18:57:18 INFO BlockManagerMasterEndpoint: Registering block manager kdl-dev-xxx-slave-01.xxx.com.vn:50004 with 366.3 MiB RAM, BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 ERROR YarnScheduler: Lost executor 1 on kdl-dev-xxx-slave-01.xxx.com.vn: Unable to create executor due to null
23/04/21 18:57:19 INFO DAGScheduler: Executor lost: 1 (epoch 0)
23/04/21 18:57:19 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
23/04/21 18:57:19 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
23/04/21 18:57:19 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
23/04/21 18:57:21 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container from a bad node: container_e108_1680686832261_1550_01_000002 on host: kdl-dev-xxx-slave-01.xxx.com.vn. Exit status: 1. Diagnostics: [2023-04-21 18:57:20.678]Exception from container-launch.
Container id: container_e108_1680686832261_1550_01_000002
Exit code: 1
Exception message: Launch container failed
Shell output: main : command provided 1
main : run as user is quanta
main : requested yarn user is xxx
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /data/hadoop/yarn/local/nmPrivate/application_1680686832261_1550/container_e108_1680686832261_1550_01_000002/container_e108_1680686832261_1550_01_000002.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...

[2023-04-21 18:57:20.680]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
bank.com.vn
23/04/21 18:57:19 WARN util.Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 50000. Attempting port 50001.
23/04/21 18:57:19 WARN util.Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 50001. Attempting port 50002.
23/04/21 18:57:19 WARN util.Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 50002. Attempting port 50003.
23/04/21 18:57:19 WARN util.Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 50003. Attempting port 50004.
23/04/21 18:57:19 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50004.
23/04/21 18:57:19 INFO netty.NettyBlockTransferService: Server created on kdl-dev-xxx-slave-01.xxx.com.vn:50004
23/04/21 18:57:19 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/04/21 18:57:19 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(1, kdl-dev-xxx-slave-01.xxx.com.vn, 50004, None)
23/04/21 18:57:19 INFO executor.Executor: Using REPL class URI: http://10.7.20.xxx:57490
23/04/21 18:57:19 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
23/04/21 18:57:20 ERROR executor.YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to null
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
    at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:169)
    at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
    at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
    at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: No FileSystem for scheme: http
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:68)
    ... 17 more
23/04/21 18:57:20 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
23/04/21 18:57:20 INFO memory.MemoryStore: MemoryStore cleared
23/04/21 18:57:20 INFO storage.BlockManager: BlockManager stopped
23/04/21 18:57:20 INFO util.ShutdownHookManager: Shutdown hook called

Thank you