databricks / spark-knowledgebase

Spark Knowledge Base
Other
334 stars 136 forks source link

Create article for "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory" #9

Open JoshRosen opened 9 years ago

JoshRosen commented 9 years ago

We should create an article for "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory", since that seems to be a common issue.

JoshRosen commented 9 years ago

One cause of this error is network connectivity issues between the master and driver, so maybe we should also add a note to https://github.com/databricks/spark-knowledgebase/blob/master/troubleshooting/connectivity_issues.md

jkleckner commented 9 years ago

Just got bitten by an environment variable set by another person for EXECUTOR_MEMORY to a large value. I was using spark-submit with --executor-memory 3G but the env var took precedence.

Do you think the explicit argument should take precedence?

JoshRosen commented 9 years ago

Hi @jkleckner,

We've deprecated most environment variables in favor of the newer configuration mechanisms, so system properties and SparkSubmit / SparkConf settings are intended to take precedence over environment variables. Which version of Spark are you using? Do you have a simple reproduction for this issue? If so, do you mind filing a JIRA ticket and linking it here? https://issues.apache.org/jira/browse/SPARK

jkleckner commented 9 years ago

Sorry, I found that someone else had explicitly programmed environment vars to override config values....

JoshRosen commented 9 years ago

Sorry, I found that someone else had explicitly programmed environment vars to override config values....

You mean in your own application / user-code, you have code that reads from the environment variable and uses it to set the corresponding SparkConf setting, or something like that?

jkleckner commented 9 years ago

Yes, in our programming someone intentionally made it work that way. Obviously I will be changing that. So to quote Emily Litella, never mind...

deepujain commented 9 years ago

I am facing this error from 4 days and no one seems to be able to figure out a fix for it. Could you please suggest something, I reduced my input data size from 1TB to 1GB to 10 simple records. I still get the same error, making me believe that this error is occurring at request time and not execution time.

jkleckner commented 9 years ago

@deepujain, If you are using YARN, bring up the page NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING Applications by browsing on the master node port 9026 (what AWS EMR uses but can vary) as in http://127.0.0.1:9026/cluster . Examine the nodes and the queues to see if there is an old application zombie around. If so kill it with:

yarn application -list
yarn application -kill <jobid>

Some situations can lead to old jobs hanging around and using up resources.

rodriguezsergio commented 9 years ago

Thanks so much for opening this issue!

I was having issues setting up a spark on mesos dev environment for the last few days and had made zero headway until I set spark.mesos.coarse to true and then lowered the spark.executor.memory below the default 512 value (running on m1.smalls on EC2 here). Couldn't even finish running /bin/run-example SparkPi 10 and was ready to give up until I saw this.

hokiegeek2 commented 9 years ago

This is excellent. I actually had a zombie Mesos Spark app, killed that, and now I am back in business--well done, guys!

jkleckner commented 9 years ago

@hokiegeek2 glad you found it.

Recently, I found that spark jobs could hang because exceptions didn't pass up to an exit and added this snippet. Now the testing process doesn't result in strewn bodies in the cluster...

    try {
      Foo.runAnalysis(sc, debug = true)
    } catch {
      case e: Exception => Seq[String]()
      println(e)
      sc.stop
      sys.exit(1)
    }
gdubicki commented 9 years ago

+1

zheolong commented 8 years ago

+1

xuedihualu commented 8 years ago

@rodriguezsergio now,I have the same issues, spark on mesos,when i Create task

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 16/01/22 15:31:25 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

My driver is only on the master, the actuator is also

vinesinha commented 8 years ago

I have a similar problem.

When I run a code on the "spark-shell", it works just fine.

However, a similar code written in eclipse and then deployed to spark master fails (no resources are assigned).

I've posted a stackoverflow post about this.

Thanks

ghost commented 8 years ago

Does it not allow multi application running in parallel? After I exit one ,the problem disappears

ggalmazor commented 8 years ago

@deepujain, If you are using YARN, bring up the page NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING Applications by browsing on the master node port 9026 (what AWS EMR uses but can vary) as in http://127.0.0.1:9026/cluster . Examine the nodes and the queues to see if there is an old application zombie around. If so kill it with: yarn application -list yarn application -kill Some situations can lead to old jobs hanging around and using up resources.

I'm using YARN and in my case the cluster is idle with all resources free to be assigned...

We use Azkaban lo enqueue a long list of processes into EMR Steps (each with one spark-submit job). I launch my queue one day and the next day, when I return to the office I find that some of the jobs have completed and one of them is stooped and waiting for 15 hours to receive resources from YARN. No other YARN process going on at the moment. All jobs request the same amount of resources.

Then I kill the queue, I relaunch the queue and the same job that had been waiting run without problems...

Any ideas?

yomige commented 8 years ago

You can check your cluster's work node cores ,then your application can't exceed that. For examle,you have two work node .And per work node have 4 cores. Then you have 2 applications to run. So you can set every application 4 core to run the job. You can set like this in the code : SparkConf sparkConf = new SparkConf().setAppName("JianSheJieDuan").set("spark.cores.max","4");

ggalmazor commented 8 years ago

Thanks @iwwenbo. When this happens, there is memory and cores enough for the task. We have determined that the problem is triggered by an Exception in the worker container that Spark is unable to recover from. This is the stacktrace:

16/06/16 13:58:53 ERROR executor.CoarseGrainedExecutorBackend: Cannot register with driver: akka.tcp://sparkDriver@10.0.4.161:36230/user/CoarseGrainedScheduler
java.lang.NullPointerException
    at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273)
    at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273)
    at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313)
    at java.lang.String.valueOf(String.java:2994)
    at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
    at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125)
    at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178)
    at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127)
    at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198)
    at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
    at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

This error prevents the worker to register with the driver and stalls. We're running everything on a single node.

ToniYang commented 8 years ago

@xuedihualu I have the same problem. Had you solve it?

fchgithub commented 8 years ago

@ggalmazor I have the same problem. It seems just one salve machine out of 4 does the entire job. I don't know where this happens. When I took that particular machine out (shutdown) I couldn't even run the shell (pyspark --master yarn-client). How did you fix this?

ggalmazor commented 8 years ago

@fchgithub we haven't solved it yet. We are currently running a crontab'ed script that detects this failures and forces the termination of the YARN applications.

clrke commented 8 years ago

+1

nvdhaider commented 8 years ago

Run into the same problem, any solutions yet?

xuedihualu commented 8 years ago

@ToniYang Hi,just Available memory deficiency!!!!

alexwwang commented 8 years ago

My problem was raised by a confusion on starting up Spark. When I start it in master mode, I should start at least a slave (by run sbin/start-slave.sh) to construct a worker in it so as to use the cpu cores and memory resources, otherwise this error rises.

For each work, I assigned 4 cpu cores ( by export spark_worker_cores in conf/spark-env.sh) and 10g memory ( spark_worker_memory) and everything's ok.

Just for reference.

harishmaiya commented 7 years ago

similar issue. I have sufficient resources (core and memory) but resource manager (yarn) is not able to execute my job. suspect it is due to worker not being registered.

umesh1989 commented 6 years ago

Hi I am kind of facing the same issue. I am deploying prediction.io on a multinode cluster where training should happen on the worker node. The worker node has been successfully registered with the master.

following are the logs of after starting slaves.sh Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 18/05/22 06:01:44 INFO Worker: Started daemon with process name: 2208@ip-172-31-6-235 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for TERM 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for HUP 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for INT 18/05/22 06:01:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/05/22 06:01:44 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:01:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:01:44 INFO Utils: Successfully started service 'sparkWorker' on port 45057. 18/05/22 06:01:44 INFO Worker: Starting Spark worker 172.31.6.235:45057 with 8 cores, 24.0 GB RAM 18/05/22 06:01:44 INFO Worker: Running Spark version 2.1.1 18/05/22 06:01:44 INFO Worker: Spark home: /home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6 18/05/22 06:01:45 INFO Utils: Successfully started service 'WorkerUI' on port 8081. 18/05/22 06:01:45 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://172.31.6.235:8081 18/05/22 06:01:45 INFO Worker: Connecting to master ip-172-31-5-119.ap-southeast-1.compute.internal:7077... 18/05/22 06:01:45 INFO TransportClientFactory: Successfully created connection to ip-172-31-5-119.ap-southeast-1.compute.internal/172.31.5.119:7077 after 19 ms (0 ms spent in bootstraps) 18/05/22 06:01:45 INFO Worker: Successfully registered with master spark://ip-172-31-5-119.ap-southeast-1.compute.internal:7077

Now the issues:

  1. if I launch one slave on master and one slave my other node: 1.1 if the slave of the master node is given fewer resources it will give some unable to re-shuffle error. 1.2 if I give more resources to the worker on the master node the all the execution happens on master node, it does not send any execution to the slave node.
  2. If I do not start a slave on the master node: 2.1 I get the following error: WARN] [TaskSchedulerImpl] Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I have assigned 24gb ram to the worker and 8 cores.

However, while I start the process following are the logs I get on slave machine: 18/05/22 06:16:00 INFO Worker: Asked to launch executor app-20180522061600-0001/0 for PredictionIO Training: com.actionml.RecommendationEngine 18/05/22 06:16:00 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:16:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:16:00 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" "./:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/conf/:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/jars/*" "-Xmx4096M" "-Dspark.driver.port=45049" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@172.31.5.119:45049" "--executor-id" "0" "--hostname" "172.31.6.235" "--cores" "8" "--app-id" "app-20180522061600-0001" "--worker-url" "spark://Worker@172.31.6.235:45057" 18/05/22 06:16:50 INFO Worker: Asked to kill executor app-20180522061600-0001/0 18/05/22 06:16:50 INFO ExecutorRunner: Runner thread for executor app-20180522061600-0001/0 interrupted 18/05/22 06:16:50 INFO ExecutorRunner: Killing process! 18/05/22 06:16:51 INFO Worker: Executor app-20180522061600-0001/0 finished with state KILLED exitStatus 143 18/05/22 06:16:51 INFO Worker: Cleaning up local directories for application app-20180522061600-0001 18/05/22 06:16:51 INFO ExternalShuffleBlockResolver: Application app-20180522061600-0001 removed, cleanupLocalDirs = true

Can somebody help me debuging the issue? Thanks!

namangt68 commented 5 years ago

+1

nianglao commented 5 years ago

Any update?

I submit 2 spark jobs on a cluster with two workers each with 4 CPUs and 14GB memory.

My config: driver.memory = 1GB executor.memory=8GB executor.cores = 2 executor.instances=1.

It's weird that some times two jobs can run concurrently but some times one job fails with "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory"

sushantway commented 5 years ago

+1

mohsinbanedar commented 5 years ago

+1

yanwerneck commented 5 years ago

+1

hunshenshi commented 5 years ago

In yarn federation,similar issue. Any ideas?

hereTac commented 4 years ago

If u are using CDH. Check some job's am log. May be you forget config the spark shuffle jar!

`

yarn.nodemanager.aux-services
<value>spark_shuffle,mapreduce_shuffle</value>

yarn.nodemanager.aux-services.spark_shuffle.class org.apache.spark.network.yarn.YarnShuffleService `
vaibhavk1992 commented 4 years ago

I am running my job in standalone mode with 1 master and 2 slaves and facing the same problem and job never completes.https://stackoverflow.com/questions/61738212/spark-job-running-for-long-for-too-small-data/61738941#61738941. Is there any resolution to this I am using spark 2.3.0 with Hadoop 2.7

manoadamro commented 4 years ago

I am running my job in standalone mode with 1 master and 2 slaves and facing the same problem and job never completes.https://stackoverflow.com/questions/61738212/spark-job-running-for-long-for-too-small-data/61738941#61738941. Is there any resolution to this I am using spark 2.3.0 with Hadoop 2.7

same issue on v3