Understanding of PodCIDRToNodeMapping.resolve behavior

echarles commented 6 years ago

I have tested the PodCIDRToNodeMapping on a k8s with the provided charts from my findings, I wonder if it behaves as expected: I have to give the nodename as input to the PodCIDRToNodeMapping.resolve method to get back the paths (see details hereafter).

So it is correct that the mapping is : nodename -> path ?

kubectl get nodes
NAME                                       STATUS    ROLES     AGE       VERSION
ip-10-0-0-204.us-west-2.compute.internal   Ready     master    3d        v1.8.4
ip-10-0-2-115.us-west-2.compute.internal   Ready     <none>    3h        v1.8.4
ip-10-0-2-230.us-west-2.compute.internal   Ready     <none>    2h        v1.8.4
ip-10-0-2-246.us-west-2.compute.internal   Ready     <none>    3h        v1.8.4
ip-10-0-3-43.us-west-2.compute.internal    Ready     <none>    2h        v1.8.4
ip-10-0-3-44.us-west-2.compute.internal    Ready     <none>    3h        v1.8.4
ip-10-0-3-74.us-west-2.compute.internal    Ready     <none>    3h        v1.8.4

kubectl get pods
hdfs-datanode-822rr                                    1/1       Running   0          2h
hdfs-datanode-8ljpq                                    1/1       Running   0          2h
hdfs-datanode-fxmrf                                    1/1       Running   0          2h
hdfs-datanode-gfszw                                    1/1       Running   0          2h
hdfs-datanode-vvpj4                                    1/1       Running   0          2h
hdfs-namenode-0                                        1/1       Running   0          2h
k8s-dashboard-kubernetes-dashboard-5c55c757f7-n798h    1/1       Running   0          4h
spark-exec-1                                           1/1       Running   0          50m
spark-exec-2                                           1/1       Running   0          50m
spark-exec-3                                           1/1       Running   0          50m
spark-k8s-resource-staging-server-7d69477f66-v2d4b     1/1       Running   0          4h
spark-k8s-shuffle-service-fbztt                        1/1       Running   0          3h
spark-k8s-shuffle-service-hshm7                        1/1       Running   0          3h
spark-k8s-shuffle-service-hwx2j                        1/1       Running   0          3h
spark-k8s-shuffle-service-jp7nf                        1/1       Running   0          3h
spark-k8s-shuffle-service-nvdq5                        1/1       Running   0          2h
spark-k8s-shuffle-service-qzxsc                        1/1       Running   0          2h
zeppelin-k8s-hdfs-locality-zeppelin-76b6cdd799-mw9q6   1/1       Running   0          1h

From the zeppelin pod, I run:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.net.PodCIDRToNodeMapping
import collection.JavaConversions._

val conf = new Configuration()
val plugin = new PodCIDRToNodeMapping()
plugin.setConf(conf)

val networkPathDirs = plugin.resolve(List(
    "ip-10-0-0-204.us-west-2.compute.internal",
    "ip-10-0-2-115.us-west-2.compute.internal",
    "ip-10-0-2-230.us-west-2.compute.internal",
    "ip-10-0-2-246.us-west-2.compute.internal",
    "ip-10-0-3-43.us-west-2.compute.internal",
    "ip-10-0-3-44.us-west-2.compute.internal",
    "ip-10-0-3-74.us-west-2.compute.internal",
    "10.0.3.74",
    "hdfs-datanode-gfszw",
    "hdfs-datanode-vvpj4",
    "unknown"
    ))
networkPathDirs.foreach(println)

and gets back:

/default-rack/ip-10-0-0-204
/default-rack/ip-10-0-2-115
/default-rack/ip-10-0-2-230
/default-rack/ip-10-0-2-246
/default-rack/ip-10-0-3-43
/default-rack/ip-10-0-3-44
/default-rack/ip-10-0-3-74
/default-rack/default-nodegroup
/default-rack/default-nodegroup
/default-rack/default-nodegroup
/default-rack/default-nodegroup

kimoonkim commented 6 years ago

Hi @echarles. Thanks for trying out the plugin.

Yes, the plugin is supposed to return a network path for a cluster node name or pod IP address. The output you shared has some good results, but it also has bad results:

/default-rack/ip-10-0-0-204 /default-rack/ip-10-0-2-115 /default-rack/ip-10-0-2-230 /default-rack/ip-10-0-2-246 /default-rack/ip-10-0-3-43 /default-rack/ip-10-0-3-44 /default-rack/ip-10-0-3-74

The above paths are returned for cluster nodes. And they are correct.

However, the following entries, returned for pod IP addresses, are bad responses. They are basically the default value indicating lookup failure.

/default-rack/default-nodegroup /default-rack/default-nodegroup /default-rack/default-nodegroup /default-rack/default-nodegroup

First of all, the plugin can handle only IP addresses of pods as input, not pod names. Because the kube-dns does not support pod name to IP translation. So only the first input "10.0.3.74" would be a valid input. But even for that we get the default path. So something is wrong.

What network provider are you using? The plugin is known to work with the kubenet provider, FYI.

More importantly, is your network provider correctly setting podCIDR on cluster nodes? You can check this using the following command, I think. podCIDR value is set inside each node spec:

$ kubectl get nodes -o json | grep -i cidr

echarles commented 6 years ago

@kimoonkim Thank you for your support, clear explanations and insights.

So the parameter of the resolve method must be an IP address (not a pod name) (btw would be good to add a small javadoc line in the PodCIDRToNodeMapping class - Before your explanation, I had to go to the super classe and read Resolves a list of DNS-names/IP-addresses and returns back a list of switch information (network paths), hence my tests.

Btw, you say kube-dns does not support reverse lookup (name to ip) - Is there a way to get this working?

I have redeployed a new cluster with calico enable (like the previous one) with kubectl apply -f https://docs.projectcalico.org/v2.6/getting-started/kubernetes/installation/hosted/kubeadm/1.6/calico.yaml.

The nodes:

NAME                                       STATUS    ROLES     AGE       VERSION
ip-10-0-0-108.us-west-2.compute.internal   Ready     <none>    2h        v1.8.4
ip-10-0-0-128.us-west-2.compute.internal   Ready     <none>    3h        v1.8.4
ip-10-0-0-199.us-west-2.compute.internal   Ready     <none>    3h        v1.8.4
ip-10-0-0-210.us-west-2.compute.internal   Ready     master    3h        v1.8.4
ip-10-0-0-36.us-west-2.compute.internal    Ready     <none>    2h        v1.8.4
ip-10-0-0-84.us-west-2.compute.internal    Ready     <none>    2h        v1.8.4
ip-10-0-0-93.us-west-2.compute.internal    Ready     <none>    2h        v1.8.4

Now, with

val networkPathDirs = plugin.resolve(List(
    "ip-10-0-0-108.us-west-2.compute.internal",
    "ip-10-0-0-128.us-west-2.compute.internal",
    "ip-10-0-0-199.us-west-2.compute.internal",
    "ip-10-0-0-210.us-west-2.compute.internal",
    "ip-10-0-0-36.us-west-2.compute.internal",
    "ip-10-0-0-84.us-west-2.compute.internal",
    "ip-10-0-0-93.us-west-2.compute.internal",
    "unkown"
    ))
networkPathDirs.foreach(println)

I receive

/default-rack/ip-10-0-0-108 
/default-rack/ip-10-0-0-128 
/default-rack/ip-10-0-0-199 
/default-rack/ip-10-0-0-210 
/default-rack/ip-10-0-0-36 
/default-rack/ip-10-0-0-84 
/default-rack/ip-10-0-0-93 
/default-rack/default-nodegroup

Which validates the test (network path are correctly resolved) and good thing, the hdfs --loglevel DEBUG dfs -cat /hosts show a connection to the local datanode.

The test is quite manual. Did you think on a way to automate this (like a HdfsLocatityTest classy stuff)?

For info, kubectl get nodes -o json | grep -i cidr returns the following which sounds good to me:

                "podCIDR": "192.168.7.0/24",
                "podCIDR": "192.168.2.0/24",
                "podCIDR": "192.168.1.0/24",
                "podCIDR": "192.168.0.0/24",
                "podCIDR": "192.168.10.0/24",
                "podCIDR": "192.168.8.0/24",
                "podCIDR": "192.168.9.0/24",

echarles commented 6 years ago

mmh, actually, my setup resolves hostname -> network path

It does not resolve ip address -> network path

Am I missing something with this calico setup?

kimoonkim commented 6 years ago

Ah, you are using calico. I believe calico does not need this PodCIDRToNodeMapping plugin. And I remember seeing calico setting podCIDRs to wrong values. That's probably why the plugin does not resolve pod IPs in your test.

I suggest to check the calico's nat-outgoing option. When I tried Calico on EC2 using kops, kops was setting this nat-outgoing automatically. Your HDFS namenode may already work without this plugin. From README.md of the plugin dir:

Calico is a popular non-overlay network provider. It turns out Calico can be also configured to do NAT between pod subnet and node subnet thanks to the nat-outgoing option. The option can be easily turned on and is enabled by default.

echarles commented 6 years ago

@kimoonkim I've given another try to Calico setting explicitly the nat-outgoing with calicoctl and get the same result (only hostname are resolved).

cat << EOF | calicoctl apply -f -
apiVersion: v1
 kind: ipPool
metadata:
  cidr: 192.168.0.0/16
spec:
  ipip:
    enabled: true
  nat-outgoing: true
EOF

Same with Flannel (only hostname are resolved) and I am even losing locality when I hdfs cat with loglevel DEBUG.

kimoonkim commented 6 years ago

If you use Calico with nat-outgoing or Flannel, then you do not need to use this PodCIDRToNodeMapping plugin. And the plugin would not do the right thing for those network providers. The namenode should can see the physical IP address of the underlying K8s cluster nodes without the plugin because the network providers rewrite pod packets by replacing pod IPs with cluster node IPs. Using the plugin only confuses the namenode.

Can you please remove the plugin from the namenode and see if the data locality works?

apache-spark-on-k8s / kubernetes-HDFS

Understanding of PodCIDRToNodeMapping.resolve behavior #29