apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Allow specifying non-local files to spark-submit (python files, and R files) #530

Open paulreimer opened 7 years ago

paulreimer commented 7 years ago

Refers to issue #527, this allows the use of python files, R and also using --py-files, when using spark-submit. Previously, the client would deny any non-local URI types when submitting a python job, even though the kubernetes spark initcontainer would be able to fulfill them (for example, gs:// URIs when the GCS connector is present in the initcontainer image).

Changing the validation to support this when isKubernetes is set, allows python jobs to use non-local URIs successfully. Only the client (spark-submit) requires this change, existing initcontainer images work fine.

What changes were proposed in this pull request?

adding && !isKubernetesCluster to core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L328 And also for the R files check.

As suggested by @liyinan926 in https://github.com/apache-spark-on-k8s/spark/issues/527#issuecomment-337699249

How was this patch tested?

Command:

./bin/spark-submit --deploy-mode cluster --master k8s://http://127.0.0.1:8001 --kubernetes-namespace spark --conf spark.kubernetes.driver.docker.image=<custom> --conf spark.kubernetes.executor.docker.image=<custom> --conf spark.kubernetes.initcontainer.docker.image=<custom> --conf spark.executor.instances=3 --conf spark.app.name=spark-pi gs://spark-resource-staging/pi.py 10

Before the change, the error was, and the job did not start:

Error: Only local python files are supported: gs://spark-resource-staging/pi.py
Run with --help for usage help or --verbose for debug output

After the change, I ran ./dev/make-distribution.sh --pip --tgz -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.7.3 locally on my macOS dev machine, and then ran it's spark-submit, and I was able to submit my python job successfully and obtain results via the logs.

ifilonenko commented 7 years ago

Good catch, thank you for this. I seem to have missed this in my PRs. This LGTM seeing as CLI is passing.

paulreimer commented 7 years ago

@felixcheung not sure, tbh. The intent seems to be that isKubernetesCluster should also support that behaviour (not formatting python path since remote file strings are supported), so I've added that in ecfa6f22b5

felixcheung commented 7 years ago

Thanks, I guess we should test this. Is there a way to call out what should be tested?

ifilonenko commented 7 years ago

rerun integration tests please

paulreimer commented 6 years ago

This looks like a CI/build system error, unrelated to the changes, but I am not able to fully interpret it.

foxish commented 6 years ago

rerun integration test please

liyinan926 commented 6 years ago

Any more comments on this and objection merging this?

foxish commented 6 years ago

ok to merge when tests pass.