Open yeikel opened 6 years ago
I have the same problem If you try to watch stdout instead stderr:
2019-02-28 03:02:00 INFO CoarseGrainedExecutorBackend:2566 - Started daemon with process name: 436@worker
2019-02-28 03:02:00 INFO SignalUtils:54 - Registered signal handler for TERM
2019-02-28 03:02:00 INFO SignalUtils:54 - Registered signal handler for HUP
2019-02-28 03:02:00 INFO SignalUtils:54 - Registered signal handler for INT
2019-02-28 03:02:00 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-02-28 03:02:00 INFO SecurityManager:54 - Changing view acls to: root,jloscalzo
2019-02-28 03:02:00 INFO SecurityManager:54 - Changing modify acls to: root,jloscalzo
2019-02-28 03:02:00 INFO SecurityManager:54 - Changing view acls groups to:
2019-02-28 03:02:00 INFO SecurityManager:54 - Changing modify acls groups to:
2019-02-28 03:02:00 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, jloscalzo); groups with view permissions: Set(); users with modify permissions: Set(root, jloscalzo); groups with modify permissions: Set()
Supose, it's the same problem. I have been trying to execute a simple code, from jupyter notebook:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# conf = SparkConf().setMaster("http://localhost:7077").setAppName("prueba")
# sc = SparkContext(conf=conf)
spark = SparkSession.builder.master("spark://localhost:7077").config('spark.submit.deployMode', 'client').appName("example").getOrCreate()
sc = spark.sparkContext
# this doesn't execute:
sc.parallelize([1,2,3,4]).sumApprox(1)
@JonathanLoscalzo I'm running into same issue. Were you able to solve the issue?
@jaskiratr not for now. Maybe the problem were that we need install spark locally as a master, but I didn't test it.
Instead, I have installed on a Google Colab notebook an instance of spark with this code:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"
I didn't catch up if we must install spark locally to connect to a remote instance.
It should be easier, but not.
Not clear how you got this.
"--driver-url" "spark://CoarseGrainedScheduler@yeikel-pc:59906"
The driver url should be spark://master:7077
if you mount your JAR into the worker container and run spark-submit
from there rather than your host machine.
Sorry @cricket007 , when you said you have not got this, you have been referring to the initial error or how I use pyspark in colab?
Where do you write "--driver-url", when you run the containers? (it's is a docker-compose file this).
Could you explain in more detail? I have found this link
@JonathanLoscalzo I was referring to OP.
Your colab setup is likely very different than running Docker Compose, and I would suggest DataProc in the GCP environment rather than doing anything manual in CoLab
I didn't catch up if we must install spark locally to connect to a remote instance
You need Spark client libraries, yes. Or you can docker exec
or ssh
to somewhere else that does.
Or you can install Apache Livy as a REST interface to submit Spark jobs
I have found this link
Find my answer there? See if that network diagram answers any of your networking issues. (Make sure you can telnet / netcat between all relevant ports)
I didn't catch up if we must install spark locally to connect to a remote instance
You need Spark client libraries, yes. Or you can
docker exec
orssh
to somewhere else that does.
Thanks @cricket007 I realized that I need spark install locally or something related to that to run the scripts ( In this case, the machine which it was running jupyter). Now, you are confirmed my issue :+1: . I suppose it is the issue of @yeikel (?)
Or you can install Apache Livy as a REST interface to submit Spark jobs
I did't use Apache Livy, you recommend it?
@JonathanLoscalzo I was referring to OP.
Your colab setup is likely very different than running Docker Compose, and I would suggest DataProc in the GCP environment rather than doing anything manual in CoLab
I don't know what is "DataProc" in GCP, Is it like Databricks for Azure? (I will check it tomorrow) For "testing purpose", Colab is good enough I suppose (testing scripts, or teach pyspark syntax). I don't know if it was related to the issue but, Could you recommend me some aproaches to using spark in a "development stage"?
Thanks for your answer!
DataProc is the managed Hadoop/Spark service by Google. Amazon and Azure have similar offerings, if that's what you want.
Databricks is purely Spark. If you want more than that, Qubole is another option.
If all you really want is to learn Spark locally, either extract it locally or use a VM simply because networking is easier and the way you would install an actual cluster would not be in containers. (and there's plenty of ways to automate the installation such as Apache Ambari, or Ansible).
Otherwise, the Cloudera/Hortonworks Sandboxes work fine.
did't use Apache Livy, you recommend it?
I've used it indirectly via HUE interface, but it was fairly straightforward to setup.
And I personally use Zeppelin over Jupyter because Spark (Scala) is more tightly integrated, though it can handle Python fine
Hi ,
I am running Spark with the following configuration:
And I have the following simple Java program:
The problem that I am having is that it is generating the following command :
Spark Executor Command: "/usr/jdk1.8.0_131/bin/java" "-cp" "/conf:/usr/spark-2.3.0/jars/*:/usr/hadoop-2.8.3/etc/hadoop/:/usr/hadoop-2.8.3/etc/hadoop/*:/usr/hadoop-2.8.3/share/hadoop/common/lib/*:/usr/hadoop-2.8.3/share/hadoop/common/*:/usr/hadoop-2.8.3/share/hadoop/hdfs/*:/usr/hadoop-2.8.3/share/hadoop/hdfs/lib/*:/usr/hadoop-2.8.3/share/hadoop/yarn/lib/*:/usr/hadoop-2.8.3/share/hadoop/yarn/*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/lib/*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/*:/usr/hadoop-2.8.3/share/hadoop/tools/lib/*" "-Xmx1024M" "-Dspark.driver.port=59906" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@yeikel-pc:59906" "--executor-id" "6" "--hostname" "172.19.0.3" "--cores" "2" "--app-id" "app-20180401005243-0000" "--worker-url" "spark://Worker@172.19.0.3:8881"
Which results in