TIBCOSoftware / snappy-on-k8s

An Integrated and collaborative cloud environment for building and running Spark applications on PKS/Kubernetes
81 stars 29 forks source link

Data transport problem #24

Open jtlz2 opened 5 years ago

jtlz2 commented 5 years ago

Having deployed using your charts, and after a hello-world pi calculation, I am trying to execute some simple commands within Jupyter, based on https://github.com/jadianes/spark-py-notebooks/tree/master/nb1-rdd-creation

Note that the kernel has to be set manually to python2, since it defaults to python3.

from pyspark.sql import SparkSession
import urllib

spark = SparkSession\
      .builder\
      .appName("PythonPi")\
      .config("spark.app.name", "spark-pi")\
      .config("spark.executor.instances", "2")\
      .getOrCreate()

f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")

sc = spark.sparkContext
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)

# Then the next line yields an Error:
raw_data.count()
Py4JJavaErrorTraceback (most recent call last)
[...]
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 24.0 failed 4 times, most recent failure: Lost task 0.3 in stage 24.0 (TID 52, 10.2.0.25, executor 1): java.io.FileNotFoundException: File file:/home/jovyan/kddcup.data_10_percent.gz does not exist

How do I make the data available to all spark workers in the k8s cluster?

dshirish commented 5 years ago

To make the data file available to executors as well, you can keep it on a HDFS compatible file system (for example S3/GCS/HDFS etc.) and use the appropriate URI in sc.textfile() call.