apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Add ability to use custom kube-dns zone. #593

Closed f355 closed 6 years ago

f355 commented 6 years ago

What changes were proposed in this pull request?

The default cluster.local zone is not something set in stone, so if kube-dns is configured to use a different zone, Spark needs to know about it in order to allow the executors reach the driver.

How was this patch tested?

The patch includes changes to two unit tests, both are passing. It is hard to integration-test this patch because that would require a custom kubernetes setup, however, current integration tests pass while using the default value for the added option.

f355 commented 6 years ago

I have no idea why the integration tests have crapped out. I don't think it's related to my change, but if it is, I'd be grateful to get some pointers to what's happening.

ifilonenko commented 6 years ago

Those integration test failures aren’t related to your additions. That failure is sporadically seen across pull requests.

foxish commented 6 years ago

@f355 - have you considered just leaving the dns zone entirely out of the equation? If the search path is setup correctly, we shouldn't need it at all.

f355 commented 6 years ago

@foxish you mean <driver-svc>.<namespace>.svc, omitting the domain? Good idea, I didn't think about it. It looks like it would work in the majority of the cases. I'm not an expert on kubernetes though, so I'm not sure if there are cases when this would break.

I think if we go with your suggestion by default, and allow to specify a domain for the weird cases where it's needed, we should be good.

But before I jump to implementing it, I'd like to hear from the maintainers.

foxish commented 6 years ago

Just fixed in upstream in https://github.com/apache/spark/pull/20187 and confirmed with SIG network that it was an acceptable way to proceed. Thanks for sending this PR btw, we wouldn't have discovered this latent issue if it hadn't been brought up here. Cheers!

f355 commented 6 years ago

Awesome, thank you!