apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Conflict with upstream SPARK-21642 #482

Closed ash211 closed 7 years ago

ash211 commented 7 years ago

This PR breaks k8s work

https://issues.apache.org/jira/browse/SPARK-21642

felixcheung commented 7 years ago

You can set a custom hostname for spark to use, see Mesos here https://github.com/apache/spark/blob/bb7afb4e10bea406a0d7ab03c2ed7aa753f081b7/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79

ash211 commented 7 years ago

After the above change, executors are launched with this difference in the podspec (redacted a bit):

Before:

[user@ip-10-0-11-20 ~]$ kubectl get pod -n my-ns-69ede3b8943b4b108834fbaa0ba24e16 my-pod-69ede3b8-943b-4b10-8834-fbaa0ba24e16-global-sql-0-1504250692857-exec-1 -o yaml | grep -A1 SPARK_DRIVER_URL
    - name: SPARK_DRIVER_URL
      value: spark://CoarseGrainedScheduler@10.255.184.2:45978
[user@ip-10-0-11-20 ~]$

After:

[user@ip-10-0-11-20 ~]$ kubectl get pod -n my-ns-98390f2ec589476f9b5ec0b62d2c5c5e my-pod-98390f2e-c589-476f-9b5e-c0b62d2c5c5e-global-sql-0-1504252263824-exec-1 -o yaml | grep -A1 SPARK_DRIVER_URL
    - name: SPARK_DRIVER_URL
      value: spark://CoarseGrainedScheduler@my-pod-98390f2e-c589-476f-9b5e-c0b62d2c5c5e-global-sql-0-150425226:36142
[user@ip-10-0-11-20 ~]$

This is the hostname that the Spark executor uses to connect to the driver on (it's the driver's hostname), but there are problems:

foxish commented 7 years ago

Pods by default don't get DNS names in k8s. Headless Services would allow us to create one without incurring much of an overhead.

kimoonkim commented 7 years ago

+1 to @foxish's suggestion. In general, there is a concern about putting too many entries in kube-dns, which is why k8s doesn't support DNS names for pods by default. But I think having one DNS name per job isn't too bad.

mccheah commented 7 years ago

Excellent - let's go with the headless service approach. I can propose a change.

mccheah commented 7 years ago

Although using a service in general implies that DNS has to be running in the cluster for Spark to work. I think this is fine, as to my understanding most real clusters will have DNS. It's worth calling out in the documentation, however.

@foxish to clarify, does the hostname we should set for the driver's URL map to exactly the name of the service? What's the mapping from namespace + service name to appropriate hostname?

Edit: Found the answer in https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/