gettyimages / docker-spark

Docker build for Apache Spark
MIT License
676 stars 370 forks source link

Have you been able to launch jobs with Java? #39

Open yeikel opened 6 years ago

yeikel commented 6 years ago

Hi ,

I am running Spark with the following configuration:

version: '2'
services:
  master:
    image: gettyimages/spark
    command: bin/spark-class org.apache.spark.deploy.master.Master -h master
    hostname: master
    environment:
      MASTER: spark://master:7077
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: localhost
    expose:
      - 7001
      - 7002
      - 7003
      - 7004
      - 7005
      - 7006
      - 7077
      - 6066
    ports:
      - 4040:4040
      - 6066:6066
      - 7077:7077
      - 8080:8080
  worker:
    image: gettyimages/spark
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 1g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 7016
      - 8881
    ports:
      - 8081:8081

And I have the following simple Java program:

SparkConf conf = new SparkConf().setMaster("spark://localhost:7077").setAppName("Word Count Sample App");
conf.set("spark.dynamicAllocation.enabled","false");
String file = "test.txt";
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> textFile = sc.textFile("src/main/resources/" + file);
JavaPairRDD<String, Integer> counts = textFile.flatMap(s -> Arrays.asList(s.split("[ ,]")).iterator()).mapToPair(word -> new Tuple2<>(word, 1)).reduceByKey((a, b) -> a + b);counts.foreach(p -> System.out.println(p));
System.out.println("Total words: " + counts.count());
counts.saveAsTextFile(file + "out.txt");

The problem that I am having is that it is generating the following command :

Spark Executor Command: "/usr/jdk1.8.0_131/bin/java" "-cp" "/conf:/usr/spark-2.3.0/jars/*:/usr/hadoop-2.8.3/etc/hadoop/:/usr/hadoop-2.8.3/etc/hadoop/*:/usr/hadoop-2.8.3/share/hadoop/common/lib/*:/usr/hadoop-2.8.3/share/hadoop/common/*:/usr/hadoop-2.8.3/share/hadoop/hdfs/*:/usr/hadoop-2.8.3/share/hadoop/hdfs/lib/*:/usr/hadoop-2.8.3/share/hadoop/yarn/lib/*:/usr/hadoop-2.8.3/share/hadoop/yarn/*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/lib/*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/*:/usr/hadoop-2.8.3/share/hadoop/tools/lib/*" "-Xmx1024M" "-Dspark.driver.port=59906" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@yeikel-pc:59906" "--executor-id" "6" "--hostname" "172.19.0.3" "--cores" "2" "--app-id" "app-20180401005243-0000" "--worker-url" "spark://Worker@172.19.0.3:8881"

Which results in


Caused by: java.io.IOException: Failed to connect to yeikel-pc:59906
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: yeikel-pc

``
JonathanLoscalzo commented 5 years ago

I have the same problem If you try to watch stdout instead stderr:

2019-02-28 03:02:00 INFO  CoarseGrainedExecutorBackend:2566 - Started daemon with process name: 436@worker
2019-02-28 03:02:00 INFO  SignalUtils:54 - Registered signal handler for TERM
2019-02-28 03:02:00 INFO  SignalUtils:54 - Registered signal handler for HUP
2019-02-28 03:02:00 INFO  SignalUtils:54 - Registered signal handler for INT
2019-02-28 03:02:00 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing view acls to: root,jloscalzo
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing modify acls to: root,jloscalzo
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-02-28 03:02:00 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, jloscalzo); groups with view permissions: Set(); users  with modify permissions: Set(root, jloscalzo); groups with modify permissions: Set()

Supose, it's the same problem. I have been trying to execute a simple code, from jupyter notebook:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# conf = SparkConf().setMaster("http://localhost:7077").setAppName("prueba")
# sc = SparkContext(conf=conf)
spark = SparkSession.builder.master("spark://localhost:7077").config('spark.submit.deployMode', 'client').appName("example").getOrCreate()
sc = spark.sparkContext

# this doesn't execute:
sc.parallelize([1,2,3,4]).sumApprox(1)
jaskiratr commented 5 years ago

@JonathanLoscalzo I'm running into same issue. Were you able to solve the issue?

JonathanLoscalzo commented 5 years ago

@jaskiratr not for now. Maybe the problem were that we need install spark locally as a master, but I didn't test it.

Instead, I have installed on a Google Colab notebook an instance of spark with this code:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"

I didn't catch up if we must install spark locally to connect to a remote instance.

It should be easier, but not.

OneCricketeer commented 4 years ago

Not clear how you got this.

"--driver-url" "spark://CoarseGrainedScheduler@yeikel-pc:59906"

The driver url should be spark://master:7077 if you mount your JAR into the worker container and run spark-submit from there rather than your host machine.

JonathanLoscalzo commented 4 years ago

Sorry @cricket007 , when you said you have not got this, you have been referring to the initial error or how I use pyspark in colab?

Where do you write "--driver-url", when you run the containers? (it's is a docker-compose file this).

Could you explain in more detail? I have found this link

OneCricketeer commented 4 years ago

@JonathanLoscalzo I was referring to OP.

Your colab setup is likely very different than running Docker Compose, and I would suggest DataProc in the GCP environment rather than doing anything manual in CoLab

OneCricketeer commented 4 years ago

I didn't catch up if we must install spark locally to connect to a remote instance

You need Spark client libraries, yes. Or you can docker exec or ssh to somewhere else that does.

Or you can install Apache Livy as a REST interface to submit Spark jobs

OneCricketeer commented 4 years ago

I have found this link

Find my answer there? See if that network diagram answers any of your networking issues. (Make sure you can telnet / netcat between all relevant ports)

JonathanLoscalzo commented 4 years ago

I didn't catch up if we must install spark locally to connect to a remote instance

You need Spark client libraries, yes. Or you can docker exec or ssh to somewhere else that does.

Thanks @cricket007 I realized that I need spark install locally or something related to that to run the scripts ( In this case, the machine which it was running jupyter). Now, you are confirmed my issue :+1: . I suppose it is the issue of @yeikel (?)

Or you can install Apache Livy as a REST interface to submit Spark jobs

I did't use Apache Livy, you recommend it?

@JonathanLoscalzo I was referring to OP.

Your colab setup is likely very different than running Docker Compose, and I would suggest DataProc in the GCP environment rather than doing anything manual in CoLab

I don't know what is "DataProc" in GCP, Is it like Databricks for Azure? (I will check it tomorrow) For "testing purpose", Colab is good enough I suppose (testing scripts, or teach pyspark syntax). I don't know if it was related to the issue but, Could you recommend me some aproaches to using spark in a "development stage"?

Thanks for your answer!

OneCricketeer commented 4 years ago

DataProc is the managed Hadoop/Spark service by Google. Amazon and Azure have similar offerings, if that's what you want.

Databricks is purely Spark. If you want more than that, Qubole is another option.

If all you really want is to learn Spark locally, either extract it locally or use a VM simply because networking is easier and the way you would install an actual cluster would not be in containers. (and there's plenty of ways to automate the installation such as Apache Ambari, or Ansible).

Otherwise, the Cloudera/Hortonworks Sandboxes work fine.

OneCricketeer commented 4 years ago

did't use Apache Livy, you recommend it?

I've used it indirectly via HUE interface, but it was fairly straightforward to setup.

And I personally use Zeppelin over Jupyter because Spark (Scala) is more tightly integrated, though it can handle Python fine