Spark executors may make the node run out of root disk space

kimoonkim commented 7 years ago

@foxish

I was running a large scale Spark TeraSort job against HDFS. The input size was 250 GB and the cluster had 9 nodes, each with 3 x 2 TB disks in addition to the 20 GB root disk.

HDFS total size was much larger than the TeraSort size. However, notice the root disk being relatively small.

Soon the executors finished reading input data. And started doing shuffle. In the middle of shuffle, many executor and datanode pods were suddenly killed. Here's kuectl get pods output:

NAMESPACE     NAME                                                    READY     STATUS    RESTARTS   AGE
default       hdfs-datanode-15t9j                                     0/1       Evicted   0          1h
default       hdfs-datanode-2zz6s                                     1/1       Running   0          1h
default       hdfs-datanode-32p6p                                     1/1       Running   0          1h
default       hdfs-datanode-4f19w                                     1/1       Running   0          1h
default       hdfs-datanode-6hwc4                                     1/1       Running   0          1h
default       hdfs-datanode-c12dr                                     1/1       Running   0          1h
default       hdfs-datanode-cdh50                                     0/1       Evicted   0          1h
default       hdfs-datanode-g24wz                                     0/1       Evicted   0          1h
default       hdfs-datanode-gw23n                                     1/1       Running   0          1h
default       hdfs-namenode-0                                         1/1       Running   0          1h
default       spark-terasort-1493767302075                            1/1       Running   0          21m
default       spark-terasort-1493767302075-exec-1                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-2                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-3                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-4                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-5                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-6                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-7                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-8                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-9                     0/1       Evicted   0          21m

Here's the output of kubectl describe EXECUTOR-POD:

Events:
  FirstSeen LastSeen    Count   From                    SubObjectPath           Type        Reason      Message
  --------- --------    -----   ----                    -------------           --------    ------      -------
  19m       19m     1   {default-scheduler }                            Normal      Scheduled   Successfully assigned spark-terasort-1493767302075-exec-1 to ip-172-20-39-112.ec2.internal
  19m       19m     1   {kubelet ip-172-20-39-112.ec2.internal} spec.containers{executor}   Normal      Pulled      Container image "gcr.io/smooth-copilot-163600/spark-executor:baseline-20170420-0a13206df6" already present on machine
  19m       19m     1   {kubelet ip-172-20-39-112.ec2.internal} spec.containers{executor}   Normal      Created     Created container with docker id 005ba71fd89f; Security:[seccomp=unconfined]
  19m       19m     1   {kubelet ip-172-20-39-112.ec2.internal} spec.containers{executor}   Normal      Started     Started container with docker id 005ba71fd89f
  12m       12m     1   {kubelet ip-172-20-39-112.ec2.internal}                 Warning     Evicted     The node was low on resource: nodefs.
  12m       12m     1   {kubelet ip-172-20-39-112.ec2.internal} spec.containers{executor}   Normal      Killing     Killing container with docker id 005ba71fd89f: Need to kill pod.

I believe this is because executors get its working space from the root volume and the root volume is too small for the job's data. (250 GB / 9 == 27 GB > 20 GB) I wonder if we thought about this before. Maybe external shffule service is required for us to process large amount of data.

This also can lead to slow shuffle performance as well, as executors don't utilize bandwidth offered by multiple disks.

foxish commented 7 years ago

This is a great experiment. You are right. There is no way to use an EmptyDir with the separately provisioned disk. There is work in that direction in the local storage proposal but at least one quarter away from an alpha implementation.

But this problem makes me wonder if we should prioritize allowing users to customize the Driver/Executor Pod Yaml directly. If that were possible, I can see two ways to get around this problem without the shuffle service:

Use hostpath volume with the mount backed by the SSD
A PersistentVolume instead of EmptyDir for the executor pods (but this would probably need some new mechanism to attach a different persistent volume to each executor)

ash211 commented 7 years ago

Some work around these local directories used for shuffle data happening at https://github.com/apache-spark-on-k8s/spark/pull/486 to place shuffle data onto an EmptyDir instead of the root container volume.

It doesn't directly address this problem though, where local disk on a node is too small for the amount of shuffle data being written. YARN has the same problem.

Even if we did have the linked local storage proposal, I don't think it handles the problem of unbounded shuffle to local storage. The closest use case in the proposal is https://github.com/kubernetes/community/pull/306/files#diff-4b06e12d174f501447474cee5093950cR117 but the end result is the pod still is killed because it's exceeded the limits the cluster administrator imposed on Bob's pod by namespace.

That said, we obviously can't conjure additional local disk when there isn't any on the kubelet, so I'm not sure what practically we should do if an EmptyDir fills up.

kimoonkim commented 7 years ago

I wonder if a pod can use multiple emptyDirs, each likely coming from a different physical disk in a round robin fashion. This way, if a cluster node has multiple disks, then the pod work dir won't be bounded by a single disk size.

I can also imagine that the cluster admin may do this in the logical volume manager layer so that k8s cluster nodes think each node has one very large logical volume combining multiple physical disks. This can be a workaround.

apache-spark-on-k8s / spark

Spark executors may make the node run out of root disk space #260