databricks / spark-knowledgebase

Spark Knowledge Base
Other
334 stars 136 forks source link

Spark Job submitted - Waiting (TaskSchedulerImpl : Initial job not accepted) #12

Open ChaiBapchya opened 8 years ago

ChaiBapchya commented 8 years ago

API call made to submit the Job. Response states - It is Running

On Cluster UI -

Worker (slave) - worker-20160712083825-172.31.17.189-59433 is Alive

Core 1 out of 2 used

Memory 1Gb out of 6 used

Running Application

app-20160713130056-0020 - Waiting since 5hrs

Cores - unlimited

Job Description of the Application

Active Stage

reduceByKey at /root/wordcount.py:23

Pending Stage

takeOrdered at /root/wordcount.py:26

Running Driver -

stderr log page for driver-20160713130051-0025

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

According to Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Slaves haven't been started - Hence it doesn't have resources.

However in my case - Slave 1 is working

According to Unable to Execute More than a spark Job "Initial job has not accepted any resources" I am using deploy-mode = cluster (not client) Since I have 1 master 1 slave and Submit API is being called via Postman / anywhere

Also the Cluster has available Cores, RAM, Memory - Still Job throws the error as conveyed by the UI

According to TaskSchedulerImpl: Initial job has not accepted any resources; I assigned

~/spark-1.5.0/conf/spark-env.sh

Spark Environment Variables

SPARK_WORKER_INSTANCES=1
SPARK_WORKER_MEMORY=1000m
SPARK_WORKER_CORES=2

Replicated those across the Slaves

sudo /root/spark-ec2/copy-dir /root/spark/conf/spark-env.sh

All the cases in the answer to above question - were applicable still no solution found. Hence because I was working with APIs and Apache SPark - maybe some other assistance is required.

Wordcount.py - My PySpark application code -
import time, re
from pyspark import SparkContext, SparkConf

def linesToWordsFunc(line):
    wordsList = line.split()
    wordsList = [re.sub(r'\W+', '', word) for word in wordsList]
    filtered = filter(lambda word: re.match(r'\w+', word), wordsList)
    return filtered

def wordsToPairsFunc(word):
    return (word, 1)

def reduceToCount(a, b):
    return (a + b)

def main():
    conf = SparkConf().setAppName("MyApp").setMaster("spark://ec2-54-209-108-127.compute-1.amazonaws.com:7077")
    sc = SparkContext(conf=conf)
    rdd = sc.textFile("/user/root/In/a.txt")

    words = rdd.flatMap(linesToWordsFunc)
    pairs = words.map(wordsToPairsFunc)
    counts = pairs.reduceByKey(reduceToCount)

    # Get the first top 100 words
    output = counts.takeOrdered(100, lambda (k, v): -v)

    for(word, count) in output:
        print word + ': ' + str(count)

    sc.stop()

if __name__ == "__main__":
    main()
ChaiBapchya commented 8 years ago

Tried running another code - Simple python application - Still the error persists

from pyspark import SparkContext, SparkConf

logFile = "/user/root/In/a.txt" 

conf = (SparkConf().set("num-executors", "1")) 

sc = SparkContext(master = "spark://ec2-54-209-108-127.compute-1.amazonaws.com:7077", appName = "MyApp", conf = conf) 

textFile = sc.textFile(logFile)

 wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) 

wordCounts.saveAsTextFile("/user/root/In/output.txt")
hoolatech commented 7 years ago

I had the same issue.

 val sparkConf = new SparkConf()
    .setAppName("Spark Pi")
    .setMaster("spark://10.100.103.25:7077")
    //.setMaster("local[*]")
    .set("spark.serializer", classOf[KryoSerializer].getName)
    .set("spark.kryo.registrator", classOf[MyRegistrator].getName)
    .set("spark.executor.memory", "3g");

  lazy val sc = new SparkContext(sparkConf)

 def main(args: Array[String]) {
    val slices =2
    val n = math.min(10000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = sc.parallelize(1 until n, slices).map { i =>
        val x = random * 2 - 1
        val y = random * 2 - 1
        if (x*x + y*y <= 1) 1 else 0
      }.reduce(_ + _)
    val pi = 4.0 * count / (n - 1)
    logger.warn(s"Pi is roughly $pi")
  }
jingkang99 commented 7 years ago

Because of the firewall policies, stop the firewall and restart all services. if master and slave already started, just stop the FW, it still doesn’t work, troubled me.

./sbin/stop-all.sh
iptables -F
umesh1989 commented 6 years ago

Hi I am kind of facing the same issue. I am deploying prediction.io on a multinode cluster where training should happen on the worker node. The worker node has been successfully registered with the master.

following are the logs of after starting slaves.sh Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 18/05/22 06:01:44 INFO Worker: Started daemon with process name: 2208@ip-172-31-6-235 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for TERM 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for HUP 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for INT 18/05/22 06:01:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/05/22 06:01:44 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:01:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:01:44 INFO Utils: Successfully started service 'sparkWorker' on port 45057. 18/05/22 06:01:44 INFO Worker: Starting Spark worker 172.31.6.235:45057 with 8 cores, 24.0 GB RAM 18/05/22 06:01:44 INFO Worker: Running Spark version 2.1.1 18/05/22 06:01:44 INFO Worker: Spark home: /home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6 18/05/22 06:01:45 INFO Utils: Successfully started service 'WorkerUI' on port 8081. 18/05/22 06:01:45 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://172.31.6.235:8081 18/05/22 06:01:45 INFO Worker: Connecting to master ip-172-31-5-119.ap-southeast-1.compute.internal:7077... 18/05/22 06:01:45 INFO TransportClientFactory: Successfully created connection to ip-172-31-5-119.ap-southeast-1.compute.internal/172.31.5.119:7077 after 19 ms (0 ms spent in bootstraps) 18/05/22 06:01:45 INFO Worker: Successfully registered with master spark://ip-172-31-5-119.ap-southeast-1.compute.internal:7077

Now the issues:

if I launch one slave on master and one slave my other node: 1.1 if the slave of the master node is given fewer resources it will give some unable to re-shuffle error. 1.2 if I give more resources to the worker on the master node the all the execution happens on master node, it does not send any execution to the slave node. If I do not start a slave on the master node: 2.1 I get the following error: WARN] [TaskSchedulerImpl] Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I have assigned 24gb ram to the worker and 8 cores.

However, while I start the process following are the logs I get on slave machine: 18/05/22 06:16:00 INFO Worker: Asked to launch executor app-20180522061600-0001/0 for PredictionIO Training: com.actionml.RecommendationEngine 18/05/22 06:16:00 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:16:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:16:00 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" "./:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/conf/:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/jars/*" "-Xmx4096M" "-Dspark.driver.port=45049" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@172.31.5.119:45049" "--executor-id" "0" "--hostname" "172.31.6.235" "--cores" "8" "--app-id" "app-20180522061600-0001" "--worker-url" "spark://Worker@172.31.6.235:45057" 18/05/22 06:16:50 INFO Worker: Asked to kill executor app-20180522061600-0001/0 18/05/22 06:16:50 INFO ExecutorRunner: Runner thread for executor app-20180522061600-0001/0 interrupted 18/05/22 06:16:50 INFO ExecutorRunner: Killing process! 18/05/22 06:16:51 INFO Worker: Executor app-20180522061600-0001/0 finished with state KILLED exitStatus 143 18/05/22 06:16:51 INFO Worker: Cleaning up local directories for application app-20180522061600-0001 18/05/22 06:16:51 INFO ExternalShuffleBlockResolver: Application app-20180522061600-0001 removed, cleanupLocalDirs = true

Can somebody help me debuging the issue? Thanks!

MironAtHome commented 4 years ago

**Looks like an "open source software" thing...

Da Real Prophessionals to not ask tis sarta quastianzz

MironAtHome commented 4 years ago

Since last post this Da Real Prophessional have figured out how to run spark cluster how to open following ports and configure those on the master node ( it's intended to be configured on master node, for anyone interested )

spark.blockManager.port          10025
spark.driver.blockManager.port   10026
spark.driver.port                10027

using \conf\spark-defaults.conf configuration file. Ran into blocked StackOverflow account here https://stackoverflow.com/questions/59209920/spark-standalone-cluster-configuration with private feedback as following: image and now rapidly moving to get a handle on task / executor to worker resolution. I would imagine one needs to write a block of code extending functionality of either logging or task infrastructure, that will report, at appropriate time in the life of task / executor instance, an event, indicating state of the instance and machine ( or possibly worker host process + machine ), an trivial from the point of engineering question, but as far as social interactions and willingness to extend support, to overcome these details from application maintainers, is rather betraying what is so wrong with Open Source community at its core. I hate you guys. Barring personal feelings, the next step, depending on findings, might end up to be custom dispatcher for standalone cluster ( if I really do not like going to Lynux, and I don't like going Lynux for the same reason, I don't like Open Source software ) or researching how to better setup cluster using Yarn or some other cluster resource management infrastructure implementation on Windows, and operating system, where stupid idiots, such as myself, still have a hope to get a fair shake from someone, who was writing operating system, just because he might like it, unlike real professionals, who write Lynux. And that step might end up more involved, than buying 1 terabyte of memory for my company SQL Server and sending our workload into "In Memory" implementation of hecatone. Which, unlike Spark has indexes and other nifty features, to make processing acceptably fast. No hard feelings...