apache-spark-on-k8s / kubernetes-HDFS

Repository holding configuration files for running an HDFS cluster in Kubernetes
Apache License 2.0
397 stars 185 forks source link

Support multiple hdfs datanodes per k8s node? #30

Open echarles opened 6 years ago

echarles commented 6 years ago

PodCIDRToNodeMapping requires on datanode per k8s node (This is enforced by the helm charts which deploy DaemonSet for the datanodes).

Now, I would like to host multiple HDFS Datanodes on one K8S nodes (let's say I have capacity on those nodes with separate volumes for each of those datanodes).

What about implementing aside (a new java class) the current PodCIDRToNodeMapping a PodCIDRToPodMapping?

I didn't try to see the feasibility but is there anything that would make this not possible? Maybe this topic has already been discussed?

kimoonkim commented 6 years ago

It is not currently possible to run multiple datanode daemons per node, because datanode daemons open the same port. It might be possible if we configure different ports per daemon. There might be other things that conflict.

If you just have multiple volumes per node, the helm chart option dataNodeHostPath supports multiple disks. From the values.yaml

# A list of the local disk directories on cluster nodes that will contain the datanode
# blocks. These paths will be mounted to the datanode as K8s HostPath volumes.
# In a command line, the list should be enclosed in '{' and '}'.
# e.g. --set "dataNodeHostPath={/hdfs-data,/hdfs-data1}"
dataNodeHostPath:
  - /hdfs-data
echarles commented 6 years ago

Thx for the hint on the dataNodeHostPath.

About the port, I would expect an exposed service to manage any potential conflict. When you expose a service, it is seen as a socket (combination of ip adress + port number), so with multiple datanode pods on a host, we would have different ip adresses all with the same port number. There may be other conflicts (to be tested).

Although labels can be used to exclude hosts, I still see a potential advantage in having multiple pods on the same host in case of unbalanced clusters (in terms of node cpu and memory).

I'd like to give a try to a PodCIDRToPodMapping and just wonder if you see any other technical blocking points with this? Another advantage with this approach, would be to use a StorageClass (https://kubernetes.io/docs/concepts/storage/storage-classes/) responsible to mount the volumes on the pod. The current approach with the chart predefining the paths (fixed number of paths) sounds less dynamic.

kimoonkim commented 6 years ago

Excellent questions.

The datanodes daemonset is currently using the hostNetwork. So they do not get virtual pod IP addresses. Instead, they are associated with the physical IP addresses of cluster nodes. Hence, the port collision concern. Also, AFAIK, daemonset does not define any services on its own. We'll have to somehow add services if we want.

Why do we use hostNetwork? Two reasons:

  1. HDFS namenode/datanode code is written such that it assumes hostname to IP and the reverse translation to be possible for those daemons. Pod network, in particular kube-dns, does not support this. This can be addressed by having services.

  2. The secure HDFS using Kerberos assumes each daemon instance has a separate Kerberos principal account. And the principal includes the hostname as part of the account name. If we somehow use StatefulSet for datanodes, this might be addressed.

So hostNetwork is the easiest way to meet these requirements. There might be other constraints as well.

tirumaraiselvan commented 6 years ago

@kimoonkim @echarles Wish I had seen this issue earlier. I was facing these same issues: did not want to use hostNetwork and wanted to scale dataNodes independent of nodes. I made the setup work exactly using StatefulSets for dataNodes so the dns lookup/reverse-lookup works.

echarles commented 6 years ago

@tirumaraiselvan I am using for now StatefulSets in this repo to scale up/down the HDFS datanodes independent of the nodes. It works pretty well but I paused the check on locality level. I will get back to this soon to ensure performance.

Did you test the locality on your side with the steps described here?

tirumaraiselvan commented 6 years ago

@echarles No, not yet. I am guessing this is just an optimization i.e. the client first tries to use data from the node it is currently on. How much does this effect the performance in a typical case, where most of the data is scattered?