update to Spark v2.3/2.4 - pyspark + client mode

jtlz2 commented 5 years ago

Hi - so very pleased to have found this project which is by far the simplest Spark+Zeppelin+Jupyter+k8s installation I have come across. You've really nailed it!

Is there any chance you could update for Spark v2.3/v2.4?

This is to allow for pyspark and client-mode.

Thanks again

dshirish commented 5 years ago

Changes for Spark 2.4 update of these charts are there in a branch:

https://github.com/SnappyDataInc/spark-on-k8s/tree/chart_upgrade_2.4

You can try out this branch. Note that docs/readme files have not been changed yet.

jtlz2 commented 5 years ago

@dshirish Perfect - thanks! Are there container images cf snappydatainc/spark-driver:v2.2.0-kubernetes-0.5.1 ?

dshirish commented 5 years ago

Images are based on Spark 2.4. values.yaml files of individual charts point to images based 2.4 version

jtlz2 commented 5 years ago

Strange - somewhere there is a mismatch in the image name:

Events: Type Reason Age From Message

Normal Scheduled 7m27s default-scheduler Successfully assigned jtlz2/spark-1552472693849-exec-1 to x Normal Pulling 5m10s (x4 over 7m7s) kubelet, x pulling image "spark-executor:2.2.0-k8s-0.5.0" Warning Failed 5m7s (x4 over 7m2s) kubelet, x Failed to pull image "spark-executor:2.2.0-k8s-0.5.0": rpc error: code = Unknown desc = Error response from daemon: pull access denied for spark-executor, repository does not exist or may require 'docker login' Warning Failed 5m7s (x4 over 7m2s) kubelet, x Error: ErrImagePull Warning Failed 4m19s (x7 over 6m56s) kubelet, x Error: ImagePullBackOff Normal BackOff 114s (x14 over 6m56s) kubelet, x Back-off pulling image "spark-executor:2.2.0-k8s-0.5.0"

It's trying to fetch spark-executor:2.2.0-k8s-0.5.0 rather than spark-executor:2.2.0-k8s-0.5.1 - but I can't see where to set this.

dshirish commented 5 years ago

This error is seen with which charts Spark 2.4 charts (branch https://github.com/SnappyDataInc/spark-on-k8s/tree/chart_upgrade_2.4) or Spark 2.2 based charts? Also error is seen in which command that is with Zeppelin/Jupyter/Spark submit?

jtlz2 commented 5 years ago

That was done by describing an exec pod:

kubectl describe po spark-1552472693849-exec-1

It's submitted from Jupyter. If I do

kubectl logs spark-all-jupyter-86879b97c8-wxbmh

I get errors like:

[I 14:08:46.742 NotebookApp] 302 GET /tree? (10.2.1.1) 0.58ms 2019-03-13 14:08:49 WARN KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources [I 14:08:53.830 NotebookApp] 302 GET / (10.2.1.1) 0.53ms

jtlz2 commented 5 years ago

@dshirish It's for the chart_upgrade_2.4 branch - the current master branch worked fine.

dshirish commented 5 years ago

The 2.4 changes in the branch are not yet complete. For example, the umbrella chart is not yet complete.

Individual charts should work fine. How are you starting the charts? Currently only umbrella chart refers to 2.2 based image for Zeppelin configuration (apart from examples in README).

jtlz2 commented 5 years ago

Thanks @dshirish - the individual charts indeed deploy just fine.

I have come across a couple of sub-issues:

In zeppelin, %spark.pyspark is not available - how can I fix this?
In the jupyter chart, where can I set environment variable values (I want to grant the notebook access to a db access secret via this mechanism; alternatively, should I put this in jupyter-with-spark/conf/secrets (which .helmignore should be updated to ignore?)?

Thanks!

dshirish commented 5 years ago

In zeppelin, %spark.pyspark is not available - how can I fix this?

We haven't tested pyspark in Zeppelin, but following changes should be needed-

The Zeppelin docker image (built using dockerfiles/zeppelin/Dockerfile) does not copy the python directory from Spark distribution. So we will need to include it and set some Python related environment variables (similar to dockerfiles/jupyter/Dockerfile) Also modify SPARK_SUBMIT_OPTIONS in zeppelin-with-spark/values.yaml to use snappydatainc/spark-py:v2.4 image for spark.kubernetes.container.image configuration

In the jupyter chart, where can I set environment variable values (I want to grant the notebook access to a db access secret via this mechanism; alternatively, should I put this in jupyter-with-spark/conf/secrets (which .helmignore should be updated to ignore?)?

Yes, you can keep the secrets file in jupyter-with-spark/conf/secrets and it will be made available on /etc/secrets/ path in the pod. Other option is to make env variables available via configmaps similar to how zeppelin-with-spark chart does (refer to values.yaml and configmap.yaml file)

TIBCOSoftware / snappy-on-k8s

update to Spark v2.3/2.4 - pyspark + client mode #23