apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Test submissions using --packages and also with ".pyc enhanced" jars #372

Open erikerlandson opened 7 years ago

erikerlandson commented 7 years ago

Spark supports two submission arguments --packages and --repositories for downloading maven-style jar artifacts (and their transitive deps), as another way to deliver jar files to drivers and executors. I expect this may already work with spark-on-k8s, but it isn't in the integration tests.

A related pyspark specific issue is that spark supports an enhanced jar artifact format that may also include .pyc files. This format is documented here: https://spark-packages.org/artifact-help

An example submission using --packages and the companion --repositories from some of my own experiments can be seen here: https://gist.github.com/erikerlandson/601a21bf50b4847314cf4d76343af699

I believe the primary use-case envisioned for this format is to allow a pyspark user to easily acquire custom pyspark definitions that are also backed by java/scala implementations, from a single jar artifact.

erikerlandson commented 7 years ago

cc @ifilonenko @foxish @mccheah