apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

PySpark Submission fails without --jars #409

Open ifilonenko opened 7 years ago

ifilonenko commented 7 years ago

An interesting problem arises when submitting the example PySpark jobs without --jars. Here is an example submission:

  env -i bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://https://192.168.99.100:8443 \
  --kubernetes-namespace default \
  --conf spark.executor.instances=1 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/driver-py:v2.1.0-kubernetes-0.3.0 \
  --conf spark.kubernetes.executor.docker.image=kubespark/executor-py:v2.1.0-kubernetes-0.3.0 \
  --conf spark.kubernetes.initcontainer.docker.image=kubespark/spark-init:v2.1.0-kubernetes-0.3.0\
  --py-files local:///opt/spark/examples/src/main/python/sort.py \
  local:///opt/spark/examples/src/main/python/pi.py 10

This causes an error: Error: Could not find or load main class .opt.spark.jars.activation-1.1.1.jar

This error is solved by passing in the necessary --jars that are supplied by the examples jar:

  env -i bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://https://192.168.99.100:8443 \
  --kubernetes-namespace default \
  --conf spark.executor.instances=1 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/driver-py:v2.1.0-kubernetes-0.3.0 \
  --conf spark.kubernetes.executor.docker.image=kubespark/executor-py:v2.1.0-kubernetes-0.3.0 \
  --conf spark.kubernetes.initcontainer.docker.image=kubespark/spark-init:v2.1.0-kubernetes-0.3.0 \
  --jars local:///opt/spark/examples/jars/spark-examples_2.11-2.1.0-k8s-0.3.0-SNAPSHOT.jar \
  --py-files local:///opt/spark/examples/src/main/python/sort.py \
  local:///opt/spark/examples/src/main/python/pi.py 10

Is this behavior expected? In the integration environment I specify jars for the second PySpark test but not for the first test (as I launch the RSS). However, both seem to pass, making me think that it isnt necessary to specify the jars.

erikerlandson commented 7 years ago

It makes sense to me that you need to explicitly specify jar deps from examples. I'm more confused by the cases where its working without those.

sahilprasad commented 7 years ago

@erikerlandson can you elaborate on why this makes sense? I just ran into this and don't understand why the jar needs to be provided when I only intend to execute a Python app.

mccheah commented 7 years ago

I don't think this is correct. @ifilonenko can you take another look at this?

erikerlandson commented 7 years ago

@sahilprasad in general python jobs may execute jvm code, and in particular they may need additional jar deps that have to be supplied using --jars

@mccheah do you mean the spark-examples jar shouldn't be needed to run pi.py ?

mccheah commented 7 years ago

I think in this particular example the error message is either incorrect or we're not passing arguments along properly: Error: Could not find or load main class .opt.spark.jars.activation-1.1.1.jar

If it was just a classpath failure then I would expect a ClassNotFoundException or something similar. Was there a stack trace @ifilonenko?

sahilprasad commented 7 years ago

@mccheah when I replicated this problem, the error that @ifilonenko provided is the only thing that I see.

ifilonenko commented 7 years ago

What is quite interesting is how I didn't have the missing jar exception when I ran this with #364. But as @erikerlandson mentioned, this seems to be attributed to the spark-examples jar being needed when running pyspark examples. I would assume that a better test would be to run PySpark tasks outside of the spark-examples and see if the error persists.

mccheah commented 7 years ago

My hypothesis is that the Python Dockerfile is running an ill-formatted Java command. @sahilprasad @ifilonenko - if either of you can track down how the command isn't being formed properly then that would be helpful. We had to fix a similar problem with https://github.com/apache-spark-on-k8s/spark/pull/444 for example.

sahilprasad commented 7 years ago

@mccheah I can take a look. I also think that it's an ill-formatted Java command that's at the root of the issue, but I'll update this issue with what I find.

erikerlandson commented 7 years ago

w.r.t. making it easier to observe command formatting, we might tweak entrypoint.sh to print out the command it's going to execute

462

sahilprasad commented 7 years ago

I was able to get @ifilonenko's first example working without --jars. When the java classpath is just /opt/spark/jars/* and submitted via the -cp flag within the Dockerfile, the jars are somehow not recognized. Adding a colon to the beginning of that classpath got it working. Should I submit a PR for this, or is there a better solution?

See changes here: https://github.com/apache-spark-on-k8s/spark/compare/branch-2.2-kubernetes...sahilprasad:python-jars

erikerlandson commented 7 years ago

@sahilprasad you should submit that as PR and we can discuss - are we over-writing existing entries on SPARK_CLASSPATH ? If it's empty I'd not expect prepending a : would change the result.