Open paulreimer opened 7 years ago
I should note that for the GCS connector, I also had to add some runtime config files (notably core-site.xml
, start-common.sh
which I merged into the this repo's entrypoint.sh
), mostly based on https://github.com/kubernetes-incubator/application-images/tree/master/spark
I also had to add :${SPARK_HOME}/conf
to SPARK_CLASSPATH
in spark-driver-py
and spark-executor-py
, for it to pick up the core-site.xml
.
cc @liyinan926 This looks similar to your work with GCS.
This should be fixed by adding !isKubernetesCluster
to https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L328.
BTW: @paulreimer I found that instead of baking in core-site.xml
just for configuration (e.g., service account configuration) for the gcs connector, you could pass in the configuration properties using --conf spark.hadoop.[ConfigurationName]
.
Interesting, it took me so long to figure out to add ${SPARK_HOME}/conf
to the SPARK_CLASSPATH
, to get it to pick up core-site.xml
, but I also tried to set fs.gs.project.id
from the command line and couldn't figure it out. Does your suggestion mean I could use --conf spark.hadoop.fs.gs.project.id
(e.g. prefix it with spark.hadoop
)? That would have saved me a lot of time.
One nice thing about baking it in though, is the start-common.sh
script detects the GCE project name and writes the fs.gs.project.id
setting in core-site.xml
before starting. That way, at least for GCE clusters, you don't need to submit that info to spark-submit
and can re-use the same image (it will work automatically for gs://
URIs, then, assuming the GCE nodes have access to the storage bucket).
(I was using only GCE resources, and so allowing "application default credentials" to Just Work, instead of manually specifying service accounts.)
@paulreimer Yes, you can use --conf spark.hadoop.fs.gs.project.id
. Spark will peel off the prefix spark.hadoop
.
Sounds good, I will need something like that for the non-GCE clusters.
I was unable to build a working distribution with the !isKubernetesCluster
change applied (it did build, but the initcontainer doesn't work.
My build also fails for Scala jobs that worked before with my image with the GCS container added (using the 0.4.0 release jars), so something must be wrong with my build environment (I have never built before). I used build/mvn -T4 -DskipTests package
, and I noticed that there are way more jars in the official release tarball than were generated in assembly/target/scala-2.11/jars
. I also didn't get a dist tarball at the end of the process, not sure if that is expected.
I would be happy to test updated binaries from a working build, with the !isKubernetesCluster
change applied, if anyone else can build them.
Try this build command
./dev/make-distribution.sh --pip --tgz -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.7.3
.
Right on, that command worked for me, and the suggested change also worked! I was able to successfully submit my python job, using gs://
on GCE (without a local copy of the file, and without using the resource-staging-server). I also applied the change to the check for R files in the same place in that file.
Note, I only had to replace the spark-submit
client binary from my build, I was able to use my existing images -- based on the official 0.4.0 binaries -- with the GCS connector added. Seems it was really just that client check denying the job, the initcontainer part worked smoothly with a gs://
URI.
Thanks so much, I really appreciate your help, @liyinan926 !
Cool! Can you submit a PR with the change? Thanks!
I extended the docker image using the recent
spark-2.2.0-k8s-0.4.0-bin-2.7.3
release to add the GCS (Google Cloud Storage) connector.Observed: It works great for scala jobs / jars with a
gs://<bucket>/
prefix - I see it creates the init container and does populate thespark-files
from what was already in GCS. However, when I try to submit a python job (or use--py-files
), thespark-submit
client does not allow thegs://
prefix and refuses the job.Expected: The job to be allowed by
spark-submit
, the relevant files populated in an initcontainer, and available for thespark-driver-py
andspark-executor-py
to use successfully.(FYI To add the GCS connector, I added these lines to
spark-base
Dockerfile:)