apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Support HDFS rack locality #350

Closed kimoonkim closed 7 years ago

kimoonkim commented 7 years ago

Closes #349 and #206.

@ash211 @foxish

Supports HDFS rack locality by implementing getRackForHost in KubernetesTaskSchedulerImpl

Added unit tests.

Also did manual testing using a dummy topology script that always returns a dummy rack name, "/rack0".

The driver log shows a small number ofRACK_LOCAL tasks, which used to be ANY tasks. (The majority of tasks are still NODE_LOCAL tasks)

2017-06-16 16:56:38 INFO KubernetesTaskSetManager:54 - Starting task 34.0 in stage 0.0 (TID 57, 10.44.0.5, executor 9, partition 34, RACK_LOCAL, 6718 bytes) 2017-06-16 16:56:39 INFO KubernetesTaskSetManager:54 - Starting task 40.0 in stage 0.0 (TID 59, 10.44.0.4, executor 7, partition 40, RACK_LOCAL, 6718 bytes) 2017-06-16 16:56:39 INFO KubernetesTaskSetManager:54 - Starting task 50.0 in stage 0.0 (TID 63, 10.46.0.4, executor 5, partition 50, RACK_LOCAL, 6719 bytes)

The job was HdfsTest.

/usr/local/spark-on-k8s/bin/spark-submit --class org.apache.spark.examples.HdfsTest --conf spark.app.name=spark-hdfstest --conf spark.dynamicAllocation.enabled=false --conf spark.shuffle.service.enabled=false --conf spark.executor.instances=10 --conf spark.kubernetes.shuffle.labels="app=spark-shuffle-service,spark-version=2.1.0" local:///opt/spark/examples/jars/spark-examples_2.11-2.1.0-k8s-0.2.0-SNAPSHOT.jar 10GB-txt

spark-defaults.conf specified the dummy topology script:

spark.hadoop.fs.defaultFS hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
spark.hadoop.net.topology.script.file.name /tmp/print_rack.sh

The dummy script print_rack.sh

#!/bin/bash

echo /rack-0

I added the script to the driver docker image manually.

kimoonkim commented 7 years ago

FYI, the unit tests failure seem genuine. I'm looking at it.

foxish commented 7 years ago

Thanks @kimoonkim! This looks awesome. Looking into this in detail shortly.

foxish commented 7 years ago

LGTM! Thanks @kimoonkim, that looks good, and appears to handle both hdfs nodes and executor pods well. Question - for the executor pods, how do you see the script to resolve rack-info working? Could it have access to more than just the Pod IP to find what rack it belongs to?

kimoonkim commented 7 years ago

@foxish Good question. My understanding is that executors do not call the topology plugin. Only the driver will consult with the topology plugin to decide which executor, hopefully a rack-local one, should receive a new task.

When an executor reads a HDFS block for the new task, it will then simply use Hadoop library code that sends a RPC request to the namenode. The namenode will consider the list of datanodes that have copies of the block. And the namenode asks the topology plugin which datanodes are better. When the namenode returns the list of datanodes, it sort them in the locality order. (node local to rack local to remote)

FYI, the HDFS on kubernetes we have does not support configuration of the topology plugin yet in the namenode helm chart. But I intend to do it soon.

The pod IP question on the namenode side is not as important as we originally thought. Because most k8s network plugins do NAT and the namenode sees the k8s cluster node IPs. (The pod IP issue on the namenode side only manifests in the kubenet on GKE) For details, please see this kubernetes-HDFS/topology/README.md

kimoonkim commented 7 years ago

@ash211 @foxish @mccheah Thanks for reviews so far. I wonder if we have any more questions or comments on this. Maybe ready to be merged soon?

kimoonkim commented 7 years ago

@mccheah Thanks for the review. Addressed comments. PTAL.

kimoonkim commented 7 years ago

I was trying to retarget this to branch-2.2-kubernetes using github UI. Then I realized that there will be too many diffs in this PR.

Maybe I should retarget to branch-2.2-kubernetes and merge changes from branch-2.2-kubernetes? @foxish Any better suggestion?

kimoonkim commented 7 years ago

Ok. Retargeted to branch-2.2-kubernetes after rebasing my branch.

kimoonkim commented 7 years ago

Merged with branch-2.2-kubernetes. While at it, also incorporated the flag from #412 to avoid expensive DNS lookup.

kimoonkim commented 7 years ago

@mccheah Can you please take a look at this PR? Perhaps, this is ready to merge after another look.

kimoonkim commented 7 years ago

@foxish @ash211 @mccheah Thanks for the reviews. Maybe this can be merged now?