Mount emptyDir volumes for temporary directories on executors in static allocation mode.

mccheah commented 7 years ago

Closes #439.

This might prove to be important for performance, especially in shuffle-heavy computations where the executors perform a large amount of disk I/O. We only provision these volumes in static allocation mode without using the shuffle service because using a shuffle service requires mounting hostPath volumes, instead.

mccheah commented 7 years ago

I will be running some benchmarks with this on an internal application. @aash and I have a suspicion that this is a bottleneck for some of our workflows.

ash211 commented 7 years ago

@mccheah build failed with

error file=/home/jenkins/workspace/PR-spark-k8s-full-build/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/ExecutorPodFactory.scala message=import.ordering.missingEmptyLine.message line=24 column=0

kimoonkim commented 7 years ago

+1. Thanks for doing this, @mccheah

Would this also apply to block manager dirs that are also local disk IO? (I didn't look at the change in details. I'd love to review the code actually)

mccheah commented 7 years ago

Missing a unit test for the new configuration step. Flow has changed; now we set spark.local.dir on the driver from the submission client, and allow that setting to propagate through to the executors. The driver always gets emptyDir volume mounts, but the executors either get either hostPath or emptyDir volume mounts depending on if the external shuffle service is used or not. I think the code is a little fragmented and could be better organized, but I will need a bit more time to think about a more cohesive architecture.

mccheah commented 7 years ago

@kimoonkim good call - I added some documentation to clarify how spark.local.dir has to be used in Kubernetes mode. Please take a look.

ash211 commented 7 years ago

We did some benchmarks in EC2 and found that disk performance inside a k8s emptyDir or hostPath was significantly better than just writing to a path in the container (using the docker layer):

defaults / not specified / in container
{
        write_prewarm_throughput_sequential_mbps: 429.36698245476106241000,
        write_warm_throughput_sequential_mbps: 419.53243028704580954000,
        read_throughput_sequential_mbps: 176.50037165910780268000,
        write_latency_ms: 3.34200000000000000000,
        hostname: ip-10-0-15-214.ec2.internal,
        instance_type: r3.4xlarge,
}

hostPath
{
        write_prewarm_throughput_sequential_mbps: 960.79753701803254654000,
        write_warm_throughput_sequential_mbps: 967.83530732197182266000,
        read_throughput_sequential_mbps: 1074.71302066251229911000,
        write_latency_ms: 1.6100000000000000000,
        hostname: ip-10-0-15-214.ec2.internal,
        instance_type: r3.4xlarge,
}

emptyDir (default medium)
{
        write_prewarm_throughput_sequential_mbps: 960.71302920136038466000,
        write_warm_throughput_sequential_mbps: 952.94596638166695748000,
        read_throughput_sequential_mbps: 1075.77150361129349967000,
        write_latency_ms: 1.64200000000000000000,
        hostname: ip-10-0-15-214.ec2.internal,
        instance_type: r3.4xlarge,
}

430 vs 960 mbps write throughput 176 vs 1075 mbps read throughput 3.3 vs 1.6 ms latency

Given this, I'm fully convinced that placing shuffle and spill data on a k8s emptyDir mount is critical for Spark performance.

ash211 commented 7 years ago

Need to rebase this onto branch-2.2-kubernetes now that https://github.com/apache-spark-on-k8s/spark/pull/459 has merged

mccheah commented 7 years ago

Replaced by #522

kimoonkim commented 7 years ago

Thanks for sharing the finding @ash211 and @mccheah!

apache-spark-on-k8s / spark

Mount emptyDir volumes for temporary directories on executors in static allocation mode. #486