apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Secure HDFS Support requires most recent PRs #390

Open ifilonenko opened 7 years ago

ifilonenko commented 7 years ago

Upon investigation of Secure HDFS support, I have found that the recent PRs that have supported moving Delegation Token Renewal logic into Spark Core are instrumental towards providing a clean implementation. The PR of focus is this one: SPARK-20434: Move Hadoop delegation token code from yarn to core by mgummelt · Pull Request #17723 · apache/spark

Which is an initial step for this PR: SPARK-16742: Mesos Kerberos Support by mgummelt · Pull Request #18519 · apache/spark. Because we will be re-using alot of this logic, what is the strategy in re-using most recent commits instead of using ugly reflections to access private methods in private packages (if that is even possible).

This issue is in reference to this PR: #373

dimberman commented 7 years ago

It sounds like you're creating a circular dependency.

The yarn package depends on the spark-core package, any attempts to pull that code back in to spark-core will expose you to bugs in the future and be a nightmare of reflection and code injection.

There are a few solutions I can think of based on this issue:

  1. Pull whatever dependencies you need back into spark-core (though judging by our IRL conversation, that might not go over so well)
  2. Create a new package that has dependencies on spark-core and yarn
  3. Move the hadoop delegation code into its own package and modify the YARN package to depend on the hadoop-delegation-token package.

Personally, 2 and 3 seem like the best stop-gap solutions, which can easily transition into a more ideal set-up at a future date.