Open JoshRosen opened 9 years ago
One cause of this error is network connectivity issues between the master and driver, so maybe we should also add a note to https://github.com/databricks/spark-knowledgebase/blob/master/troubleshooting/connectivity_issues.md
Just got bitten by an environment variable set by another person for EXECUTOR_MEMORY to a large value. I was using spark-submit with --executor-memory 3G but the env var took precedence.
Do you think the explicit argument should take precedence?
Hi @jkleckner,
We've deprecated most environment variables in favor of the newer configuration mechanisms, so system properties and SparkSubmit / SparkConf settings are intended to take precedence over environment variables. Which version of Spark are you using? Do you have a simple reproduction for this issue? If so, do you mind filing a JIRA ticket and linking it here? https://issues.apache.org/jira/browse/SPARK
Sorry, I found that someone else had explicitly programmed environment vars to override config values....
Sorry, I found that someone else had explicitly programmed environment vars to override config values....
You mean in your own application / user-code, you have code that reads from the environment variable and uses it to set the corresponding SparkConf setting, or something like that?
Yes, in our programming someone intentionally made it work that way. Obviously I will be changing that. So to quote Emily Litella, never mind...
I am facing this error from 4 days and no one seems to be able to figure out a fix for it. Could you please suggest something, I reduced my input data size from 1TB to 1GB to 10 simple records. I still get the same error, making me believe that this error is occurring at request time and not execution time.
@deepujain, If you are using YARN, bring up the page NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING Applications
by browsing on the master node port 9026
(what AWS EMR uses but can vary) as in http://127.0.0.1:9026/cluster
. Examine the nodes and the queues to see if there is an old application zombie around. If so kill it with:
yarn application -list
yarn application -kill <jobid>
Some situations can lead to old jobs hanging around and using up resources.
Thanks so much for opening this issue!
I was having issues setting up a spark on mesos dev environment for the last few days and had made zero headway until I set spark.mesos.coarse
to true
and then lowered the spark.executor.memory
below the default 512 value (running on m1.smalls on EC2 here). Couldn't even finish running /bin/run-example SparkPi 10
and was ready to give up until I saw this.
This is excellent. I actually had a zombie Mesos Spark app, killed that, and now I am back in business--well done, guys!
@hokiegeek2 glad you found it.
Recently, I found that spark jobs could hang because exceptions didn't pass up to an exit and added this snippet. Now the testing process doesn't result in strewn bodies in the cluster...
try {
Foo.runAnalysis(sc, debug = true)
} catch {
case e: Exception => Seq[String]()
println(e)
sc.stop
sys.exit(1)
}
+1
+1
@rodriguezsergio now,I have the same issues, spark on mesos,when i Create task
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 16/01/22 15:31:25 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My driver is only on the master, the actuator is also
I have a similar problem.
When I run a code on the "spark-shell", it works just fine.
However, a similar code written in eclipse and then deployed to spark master fails (no resources are assigned).
I've posted a stackoverflow post about this.
Thanks
Does it not allow multi application running in parallel? After I exit one ,the problem disappears
@deepujain, If you are using YARN, bring up the page NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING Applications by browsing on the master node port 9026 (what AWS EMR uses but can vary) as in http://127.0.0.1:9026/cluster . Examine the nodes and the queues to see if there is an old application zombie around. If so kill it with: yarn application -list yarn application -kill
Some situations can lead to old jobs hanging around and using up resources.
I'm using YARN and in my case the cluster is idle with all resources free to be assigned...
We use Azkaban lo enqueue a long list of processes into EMR Steps (each with one spark-submit job). I launch my queue one day and the next day, when I return to the office I find that some of the jobs have completed and one of them is stooped and waiting for 15 hours to receive resources from YARN. No other YARN process going on at the moment. All jobs request the same amount of resources.
Then I kill the queue, I relaunch the queue and the same job that had been waiting run without problems...
Any ideas?
You can check your cluster's work node cores ,then your application can't exceed that. For examle,you have two work node .And per work node have 4 cores. Then you have 2 applications to run. So you can set every application 4 core to run the job. You can set like this in the code : SparkConf sparkConf = new SparkConf().setAppName("JianSheJieDuan").set("spark.cores.max","4");
Thanks @iwwenbo. When this happens, there is memory and cores enough for the task. We have determined that the problem is triggered by an Exception in the worker container that Spark is unable to recover from. This is the stacktrace:
16/06/16 13:58:53 ERROR executor.CoarseGrainedExecutorBackend: Cannot register with driver: akka.tcp://sparkDriver@10.0.4.161:36230/user/CoarseGrainedScheduler
java.lang.NullPointerException
at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273)
at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273)
at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313)
at java.lang.String.valueOf(String.java:2994)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
This error prevents the worker to register with the driver and stalls. We're running everything on a single node.
@xuedihualu I have the same problem. Had you solve it?
@ggalmazor I have the same problem. It seems just one salve machine out of 4 does the entire job. I don't know where this happens. When I took that particular machine out (shutdown) I couldn't even run the shell (pyspark --master yarn-client). How did you fix this?
@fchgithub we haven't solved it yet. We are currently running a crontab'ed script that detects this failures and forces the termination of the YARN applications.
+1
Run into the same problem, any solutions yet?
@ToniYang Hi,just Available memory deficiency!!!!
My problem was raised by a confusion on starting up Spark. When I start it in master mode, I should start at least a slave (by run sbin/start-slave.sh) to construct a worker in it so as to use the cpu cores and memory resources, otherwise this error rises.
For each work, I assigned 4 cpu cores ( by export spark_worker_cores in conf/spark-env.sh) and 10g memory ( spark_worker_memory) and everything's ok.
Just for reference.
similar issue. I have sufficient resources (core and memory) but resource manager (yarn) is not able to execute my job. suspect it is due to worker not being registered.
Hi I am kind of facing the same issue. I am deploying prediction.io on a multinode cluster where training should happen on the worker node. The worker node has been successfully registered with the master.
following are the logs of after starting slaves.sh
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 18/05/22 06:01:44 INFO Worker: Started daemon with process name: 2208@ip-172-31-6-235 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for TERM 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for HUP 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for INT 18/05/22 06:01:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/05/22 06:01:44 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:01:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:01:44 INFO Utils: Successfully started service 'sparkWorker' on port 45057. 18/05/22 06:01:44 INFO Worker: Starting Spark worker 172.31.6.235:45057 with 8 cores, 24.0 GB RAM 18/05/22 06:01:44 INFO Worker: Running Spark version 2.1.1 18/05/22 06:01:44 INFO Worker: Spark home: /home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6 18/05/22 06:01:45 INFO Utils: Successfully started service 'WorkerUI' on port 8081. 18/05/22 06:01:45 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://172.31.6.235:8081 18/05/22 06:01:45 INFO Worker: Connecting to master ip-172-31-5-119.ap-southeast-1.compute.internal:7077... 18/05/22 06:01:45 INFO TransportClientFactory: Successfully created connection to ip-172-31-5-119.ap-southeast-1.compute.internal/172.31.5.119:7077 after 19 ms (0 ms spent in bootstraps) 18/05/22 06:01:45 INFO Worker: Successfully registered with master spark://ip-172-31-5-119.ap-southeast-1.compute.internal:7077
Now the issues:
I have assigned 24gb ram to the worker and 8 cores.
However, while I start the process following are the logs I get on slave machine:
18/05/22 06:16:00 INFO Worker: Asked to launch executor app-20180522061600-0001/0 for PredictionIO Training: com.actionml.RecommendationEngine 18/05/22 06:16:00 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:16:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:16:00 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" "./:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/conf/:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/jars/*" "-Xmx4096M" "-Dspark.driver.port=45049" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@172.31.5.119:45049" "--executor-id" "0" "--hostname" "172.31.6.235" "--cores" "8" "--app-id" "app-20180522061600-0001" "--worker-url" "spark://Worker@172.31.6.235:45057" 18/05/22 06:16:50 INFO Worker: Asked to kill executor app-20180522061600-0001/0 18/05/22 06:16:50 INFO ExecutorRunner: Runner thread for executor app-20180522061600-0001/0 interrupted 18/05/22 06:16:50 INFO ExecutorRunner: Killing process! 18/05/22 06:16:51 INFO Worker: Executor app-20180522061600-0001/0 finished with state KILLED exitStatus 143 18/05/22 06:16:51 INFO Worker: Cleaning up local directories for application app-20180522061600-0001 18/05/22 06:16:51 INFO ExternalShuffleBlockResolver: Application app-20180522061600-0001 removed, cleanupLocalDirs = true
Can somebody help me debuging the issue? Thanks!
+1
Any update?
I submit 2 spark jobs on a cluster with two workers each with 4 CPUs and 14GB memory.
My config: driver.memory = 1GB executor.memory=8GB executor.cores = 2 executor.instances=1.
It's weird that some times two jobs can run concurrently but some times one job fails with "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory"
+1
+1
+1
In yarn federation,similar issue. Any ideas?
If u are using CDH. Check some job's am log. May be you forget config the spark shuffle jar!
`
<value>spark_shuffle,mapreduce_shuffle</value>
I am running my job in standalone mode with 1 master and 2 slaves and facing the same problem and job never completes.https://stackoverflow.com/questions/61738212/spark-job-running-for-long-for-too-small-data/61738941#61738941. Is there any resolution to this I am using spark 2.3.0 with Hadoop 2.7
I am running my job in standalone mode with 1 master and 2 slaves and facing the same problem and job never completes.https://stackoverflow.com/questions/61738212/spark-job-running-for-long-for-too-small-data/61738941#61738941. Is there any resolution to this I am using spark 2.3.0 with Hadoop 2.7
same issue on v3
We should create an article for "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory", since that seems to be a common issue.