Closed mccheah closed 7 years ago
I will be running some benchmarks with this on an internal application. @aash and I have a suspicion that this is a bottleneck for some of our workflows.
@mccheah build failed with
error file=/home/jenkins/workspace/PR-spark-k8s-full-build/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/ExecutorPodFactory.scala message=import.ordering.missingEmptyLine.message line=24 column=0
+1. Thanks for doing this, @mccheah
Would this also apply to block manager dirs that are also local disk IO? (I didn't look at the change in details. I'd love to review the code actually)
Missing a unit test for the new configuration step. Flow has changed; now we set spark.local.dir
on the driver from the submission client, and allow that setting to propagate through to the executors. The driver always gets emptyDir
volume mounts, but the executors either get either hostPath
or emptyDir
volume mounts depending on if the external shuffle service is used or not. I think the code is a little fragmented and could be better organized, but I will need a bit more time to think about a more cohesive architecture.
@kimoonkim good call - I added some documentation to clarify how spark.local.dir
has to be used in Kubernetes mode. Please take a look.
We did some benchmarks in EC2 and found that disk performance inside a k8s emptyDir
or hostPath
was significantly better than just writing to a path in the container (using the docker layer):
defaults / not specified / in container
{
write_prewarm_throughput_sequential_mbps: 429.36698245476106241000,
write_warm_throughput_sequential_mbps: 419.53243028704580954000,
read_throughput_sequential_mbps: 176.50037165910780268000,
write_latency_ms: 3.34200000000000000000,
hostname: ip-10-0-15-214.ec2.internal,
instance_type: r3.4xlarge,
}
hostPath
{
write_prewarm_throughput_sequential_mbps: 960.79753701803254654000,
write_warm_throughput_sequential_mbps: 967.83530732197182266000,
read_throughput_sequential_mbps: 1074.71302066251229911000,
write_latency_ms: 1.6100000000000000000,
hostname: ip-10-0-15-214.ec2.internal,
instance_type: r3.4xlarge,
}
emptyDir (default medium)
{
write_prewarm_throughput_sequential_mbps: 960.71302920136038466000,
write_warm_throughput_sequential_mbps: 952.94596638166695748000,
read_throughput_sequential_mbps: 1075.77150361129349967000,
write_latency_ms: 1.64200000000000000000,
hostname: ip-10-0-15-214.ec2.internal,
instance_type: r3.4xlarge,
}
430 vs 960 mbps write throughput 176 vs 1075 mbps read throughput 3.3 vs 1.6 ms latency
Given this, I'm fully convinced that placing shuffle and spill data on a k8s emptyDir
mount is critical for Spark performance.
Need to rebase this onto branch-2.2-kubernetes
now that https://github.com/apache-spark-on-k8s/spark/pull/459 has merged
Replaced by #522
Thanks for sharing the finding @ash211 and @mccheah!
Closes #439.
This might prove to be important for performance, especially in shuffle-heavy computations where the executors perform a large amount of disk I/O. We only provision these volumes in static allocation mode without using the shuffle service because using a shuffle service requires mounting hostPath volumes, instead.