apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Error: Only local python files are supported: gs://... #527

Open paulreimer opened 7 years ago

paulreimer commented 7 years ago

I extended the docker image using the recent spark-2.2.0-k8s-0.4.0-bin-2.7.3 release to add the GCS (Google Cloud Storage) connector.

Observed: It works great for scala jobs / jars with a gs://<bucket>/ prefix - I see it creates the init container and does populate the spark-files from what was already in GCS. However, when I try to submit a python job (or use --py-files), the spark-submit client does not allow the gs:// prefix and refuses the job.

Error: Only local python files are supported: gs://<my_bucket_name>/pi.py
Run with --help for usage help or --verbose for debug output

Expected: The job to be allowed by spark-submit, the relevant files populated in an initcontainer, and available for the spark-driver-py and spark-executor-py to use successfully.

(FYI To add the GCS connector, I added these lines to spark-base Dockerfile:)

ENV hadoop_ver 2.7.4
# Add Hadoop 2.x native libs
ADD http://www.us.apache.org/dist/hadoop/common/hadoop-${hadoop_ver}/hadoop-${hadoop_ver}.tar.gz /opt/
RUN cd /opt/ && \
    tar xf hadoop-${hadoop_ver}.tar.gz && \
    ln -s hadoop-${hadoop_ver} hadoop

# Add the GCS connector.
ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar ${SPARK_HOME}/jars/
paulreimer commented 7 years ago

I should note that for the GCS connector, I also had to add some runtime config files (notably core-site.xml, start-common.sh which I merged into the this repo's entrypoint.sh), mostly based on https://github.com/kubernetes-incubator/application-images/tree/master/spark

I also had to add :${SPARK_HOME}/conf to SPARK_CLASSPATH in spark-driver-py and spark-executor-py, for it to pick up the core-site.xml.

foxish commented 7 years ago

cc @liyinan926 This looks similar to your work with GCS.

liyinan926 commented 7 years ago

This should be fixed by adding !isKubernetesCluster to https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L328.

liyinan926 commented 7 years ago

BTW: @paulreimer I found that instead of baking in core-site.xml just for configuration (e.g., service account configuration) for the gcs connector, you could pass in the configuration properties using --conf spark.hadoop.[ConfigurationName].

paulreimer commented 7 years ago

Interesting, it took me so long to figure out to add ${SPARK_HOME}/conf to the SPARK_CLASSPATH, to get it to pick up core-site.xml, but I also tried to set fs.gs.project.id from the command line and couldn't figure it out. Does your suggestion mean I could use --conf spark.hadoop.fs.gs.project.id (e.g. prefix it with spark.hadoop)? That would have saved me a lot of time.

One nice thing about baking it in though, is the start-common.sh script detects the GCE project name and writes the fs.gs.project.id setting in core-site.xml before starting. That way, at least for GCE clusters, you don't need to submit that info to spark-submit and can re-use the same image (it will work automatically for gs:// URIs, then, assuming the GCE nodes have access to the storage bucket).

paulreimer commented 7 years ago

(I was using only GCE resources, and so allowing "application default credentials" to Just Work, instead of manually specifying service accounts.)

liyinan926 commented 7 years ago

@paulreimer Yes, you can use --conf spark.hadoop.fs.gs.project.id. Spark will peel off the prefix spark.hadoop.

liyinan926 commented 7 years ago

@paulreimer FYI https://github.com/liyinan926/spark-gcp-examples/blob/master/spark-examples/bigquery-wordcount/README.md.

paulreimer commented 7 years ago

Sounds good, I will need something like that for the non-GCE clusters.

I was unable to build a working distribution with the !isKubernetesCluster change applied (it did build, but the initcontainer doesn't work.

My build also fails for Scala jobs that worked before with my image with the GCS container added (using the 0.4.0 release jars), so something must be wrong with my build environment (I have never built before). I used build/mvn -T4 -DskipTests package, and I noticed that there are way more jars in the official release tarball than were generated in assembly/target/scala-2.11/jars. I also didn't get a dist tarball at the end of the process, not sure if that is expected.

I would be happy to test updated binaries from a working build, with the !isKubernetesCluster change applied, if anyone else can build them.

liyinan926 commented 7 years ago

Try this build command ./dev/make-distribution.sh --pip --tgz -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.7.3.

paulreimer commented 7 years ago

Right on, that command worked for me, and the suggested change also worked! I was able to successfully submit my python job, using gs:// on GCE (without a local copy of the file, and without using the resource-staging-server). I also applied the change to the check for R files in the same place in that file.

Note, I only had to replace the spark-submit client binary from my build, I was able to use my existing images -- based on the official 0.4.0 binaries -- with the GCS connector added. Seems it was really just that client check denying the job, the initcontainer part worked smoothly with a gs:// URI.

Thanks so much, I really appreciate your help, @liyinan926 !

liyinan926 commented 7 years ago

Cool! Can you submit a PR with the change? Thanks!