Azure / aks-engine

AKS Engine: legacy tool for Kubernetes on Azure (see status)
https://github.com/Azure/aks-engine
MIT License
1.03k stars 522 forks source link

Wrong --init-cluster parameter of etcd, it uses worker nodes IPs instead of master nodes IPs #791

Closed zjfroot closed 5 years ago

zjfroot commented 5 years ago

Is this a request for help?:

Yes

Is this an ISSUE or FEATURE REQUEST? (choose one):

ISSUE

What version of aks-engine?:

v0.32.3

Kubernetes version:

1.12

What happened: We were trying to deploy a k8s cluster with aks-engine to our existing vnet.

We have two subnets:

worker nodes subnet has a cidr of 10.11.0.0/19 master nodes subnet has a cidr of 10.11.255.0/24

The vnet has a cidr 10.11.0.0/16

We have a following cluster config for aks-engine (VMSS, Multi Zone, Custom Vnet/Subnets):

{
    "apiVersion": "vlabs",
    "properties": {
      "orchestratorProfile": {
        "orchestratorType": "Kubernetes",
        "orchestratorRelease": "1.12",
        "kubernetesConfig": {
          "clusterSubnet": "10.11.0.0/19"
        }
      },
      "masterProfile": {
        "count": 3,
        "OSDiskSizeGB": 100,
        "dnsPrefix": "production-cluster-abcdef",
        "vmSize": "Standard_DS2_v2",
        "vnetSubnetId":"/subscriptions/xxx/k8smasters",
        "agentVnetSubnetId":"/subscriptions/xxx/default",
        "vnetCidr":"10.11.0.0/16",
        "availabilityProfile": "VirtualMachineScaleSets",
        "availabilityZones": [
            "1",
            "2",
            "3"
        ]
      },
      "agentPoolProfiles": [
        {
            "name": "agentpool",
            "count": 6,
            "vmSize": "Standard_DS2_v2",
            "OSDiskSizeGB": 100,
            "vnetSubnetId":"/subscriptions/xxx/default",
            "availabilityProfile": "VirtualMachineScaleSets",
            "availabilityZones": [
                "1",
                "2",
                "3"
            ]
        }
      ],
      "linuxProfile": {
        "adminUsername": "azureuser",
        "ssh": {
          "publicKeys": [
            {
              "keyData": "xxx"
            }
          ]
        }
      },
      "servicePrincipalProfile": {
        "clientId": "",
        "secret": ""
      }
    }
}

We tried a few times, it always fail to deploy the master vmss. After checking the /var/log/azure/cluster-provision.log of the master nodes, it turns out etcd couldn't start.

It looks like etcd has following command line parameters:

--name k8s-master-32067141-vmss000002 --peer-client-cert-auth --peer-trusted-ca-file=/etc/kubernetes/certs/ca.crt --peer-cert-file=/etc/kubernetes/certs/etcdpeer2.crt --peer-key-file=/etc/kubernetes/certs/etcdpeer2.key --initial-advertise-peer-urls https://10.11.255.67:2380 --listen-peer-urls https://10.11.255.67:2380 --client-cert-auth --trusted-ca-file=/etc/kubernetes/certs/ca.crt --cert-file=/etc/kubernetes/certs/etcdserver.crt --key-file=/etc/kubernetes/certs/etcdserver.key --advertise-client-urls https://10.11.255.67:2379 --listen-client-urls https://10.11.255.67:2379,https://127.0.0.1:2379 --initial-cluster-token k8s-etcd-cluster --initial-cluster k8s-master-32067141-vmss000000=https://10.11.0.4:2380,k8s-master-32067141-vmss000001=https://10.11.0.35:2380,k8s-master-32067141-vmss000002=https://10.11.0.66:2380 --data-dir /var/lib/etcddisk --initial-cluster-state new

It seems like it gives a wrong --init-cluster parameter. It passes https://10.11.0.4:2380, https://10.11.0.35:2380 and https://10.11.0.66:2380, but they are actually worker nodes IPs, not master nodes.

If I understand correctly, --init-cluster parameter of etcd should contain the master nodes, not the worker nodes.

What you expected to happen: We expect masters get correctly provisioned and etcd up and running.

How to reproduce it (as minimally and precisely as possible): Try to use a similar json config as the example above, with the same subnets ranges, deploy it with aks-engine.

Anything else we need to know: Maybe our config is not really correct? We might need to specify something extra, so that aks-engine can pick up the correct master IPs when starting etcd?

welcome[bot] commented 5 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

jackfrancis commented 5 years ago

Hi @zjfroot, are you intentionally putting your master nodes into a subnet outside your --cluster-cidr IP address range? @palma21 to your knowledge is that a viable network configuration?

palma21 commented 5 years ago

I think clusterSubnet range might be the problem (would using 10.11.0.0/16 work?), but this should be a possible config

CC @khenidak @juan-lee for thoughts

zjfroot commented 5 years ago

@jackfrancis When you say --cluster-cidr, do you mean the one 10.11.0.0/19 in clusterSubnet of kubernetesConfig?

In our case, the master and worker nodes subnets are not next to each other, if we think cidr wise. How can we specify a clusterSubnet that covers both?

Also from the cluster definition doc, the clusterSubnet is:

The IP subnet used for allocating IP addresses for pod network interfaces.

I assumed it is only for the worker nodes, that is why the cidr range for clusterSubnet in my example is 10.11.0.0/19, which is the worker nodes subnet.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.