Azure / acs-engine

WE HAVE MOVED: Please join us at Azure/aks-engine!
https://github.com/Azure/aks-engine
MIT License
1.03k stars 560 forks source link

k8s 1.8 template generated by acs-engine no longer working #1621

Closed zimmertr closed 6 years ago

zimmertr commented 6 years ago

Is this a request for help?:

No; bug report

Is this an ISSUE or FEATURE REQUEST? (choose one):

ISSUE

What version of acs-engine?:

5b57309 was used to generate the template.


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)

k8s 1.8

What happened:

One week ago I generated the following template with acs-engine. k8s deployed fine from the template 3-4 times in a row. The template is no longer provisioning a working k8s 1.8 cluster.

https://gist.githubusercontent.com/zimmertr/e898f95077181f3a089cd5896f2f95aa/raw/6053d6edb879264bad2b8c59758301ee125a3154/gistfile1.txt

Oct 19 15:18:40 k8s-master-48084675-0 docker[1641]: W1019 15:18:40.700140 1838 status_manager.go:431] Failed to get status for pod "calico-node-r9n9g_kube-system(c796464e-b46b-11e7-8b57-000d3a220cf3)": client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://127.0.0.1:2379 has no leader

user@k8s-master-48084675-0:~$ etcdctl member list
33b0bb17e22c066a: name=k8s-master-48084675-0 peerURLs=http://172.17.38.30:2380 clientURLs=http://172.17.38.30:2379
7f03ea887d8488f0: name=k8s-master-48084675-2 peerURLs=http://172.17.38.32:2380 clientURLs=http://172.17.38.32:2379
a4058da44ae0514c: name=k8s-master-48084675-1 peerURLs=http://172.17.38.31:2380 clientURLs=http://172.17.38.31:2379

How to reproduce it (as minimally and precisely as possible):

Deploy the template listed above. The template is meant to be deployed on top of this network schema:

https://gist.githubusercontent.com/zimmertr/d03d04ae5b1325af25095aa6aa5b12db/raw/3b254d65b19a06d769febd4acfcf4dfce1232090/gistfile1.txt

Anything else we need to know:

This template was working 100% successfully before. It is randomly no longer working.

Calico released a new version (2.6.2) 3 days ago which may be the culprit. But they claim no changes were made to k8s in the release notes: https://github.com/projectcalico/calico/releases/tag/v2.6.2

However, this pull request addressed a very similar Calico issue when k8s 1.8 first came out: https://github.com/Azure/acs-engine/pull/1511

It's also possible this commit broke it as it effected etcd recently as well: https://github.com/Azure/acs-engine/pull/1564

jchauncey commented 6 years ago

cc @anhowe @jackfrancis @CecileRobertMichon thoughts?

jackfrancis commented 6 years ago

@zimmertr could you verify that acs-engine:master as of 2017-10-23 still includes this regression? Let me know if you run into any trouble building local binaries.

zimmertr commented 6 years ago

@jackfrancis I can confirm that building the template with acs-engine v.0.8.0 did not resolve the issue. I previously built the template with 5b57309 so I felt obligated to start this comment by saying so.

When building the template with 5b57309, the above error occurs.

When building teh template with the v0.8.0 release, the following error occurs:

Deployment failed. Deployment template validation failed: 'The template resource 'k8s-master-25635399-0' at line '1' and column '60028' is not valid: The language expression length limit exceeded. Limit: '24576' and actual: '24914'.. Please see https://aka.ms/arm-template-expressions for usage details.'.

I'm attempting to build acs-engine from master right now. Unfortunately it looks like the script used to prepare the docker container for acs-engine is not currently working. When executed it proceeds most of the way through the Docker Run statement until the following command fails:

The command '/bin/sh -c curl "https://storage.googleapis.com/kubernetes-release/release/v${KUBECTL_VERSION}/bin/linux/amd64/kubectl" > /usr/local/bin/kubectl     && chmod +x /usr/local/bin/kubectl' returned a non-zero code: 56

I will build acs-engine from master manually and update. It may take a few moments though as I'll need to get golang and other dependencies installed.

zimmertr commented 6 years ago

@jackfrancis using 5b57309d the template deploys successfully.

All masters are added to the cluster and eventually kubectl get nodes reports a READY state for each Master. After a few minutes the agents are then also added to the cluster. And eventually kubectl get nodes reports both the masters and the agents as READY.

However, after all of the nodes become ready, the following logs just loop in journalctl -xef:

Oct 24 01:00:54 k8s-master-25635399-0 docker[1780]: I1024 01:00:54.489076    1973 kuberuntime_manager.go:499] Container {Name:calico-node Image:quay.io/calico/node:v2.4.1 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:DATASTORE_TYPE Value:kubernetes ValueFrom:nil} {Name:FELIX_LOGSEVERITYSCREEN Value:info ValueFrom:nil} {Name:FELIX_IPTABLESREFRESHINTERVAL Value:60 ValueFrom:nil} {Name:FELIX_IPV6SUPPORT Value:false ValueFrom:nil} {Name:CALICO_NETWORKING_BACKEND Value:none ValueFrom:nil} {Name:CLUSTER_TYPE Value:k8s,acse ValueFrom:nil} {Name:CALICO_DISABLE_FILE_LOGGING Value:true ValueFrom:nil} {Name:WAIT_FOR_DATASTORE Value:true ValueFrom:nil} {Name:IP Value: ValueFrom:nil} {Name:CALICO_IPV4POOL_CIDR Value:172.16.38.0/23 ValueFrom:nil} {Name:CALICO_IPV4POOL_IPIP Value:off ValueFrom:nil} {Name:FELIX_IPINIPENABLED Value:false ValueFrom:nil} {Name:FELIX_HEALTHENABLED Value:true ValueFrom:nil} {Name:NODENAME Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:spec.nodeName,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:FELIX_DEFAULTENDPOINTTOHOSTACTION Value:ACCEPT ValueFrom:nil}] Resources:{Limits:map[] Requests:map[cpu:{i:{value:250 scale:-3} d:{Dec:<nil>} s:250m Format:DecimalSI}]} VolumeMounts:[{Name:lib-modules ReadOnly:true MountPath:/lib/modules SubPath: MountPropagation:<nil>} {Name:var-run-calico ReadOnly:false MountPath:/var/run/calico SubPath: MountPropagation:<nil>} {Name:calico-node-token-vdwns ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/liveness,Port:9099,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:6,} ReadinessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/readiness,Port:9099,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil Te

Oct 24 01:00:54 k8s-master-25635399-0 docker[1780]: rminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.

Oct 24 01:00:54 k8s-master-25635399-0 docker[1780]: I1024 01:00:54.489228    1973 kuberuntime_manager.go:738] checking backoff for container "calico-node" in pod "calico-node-tlbkd_kube-system(3ab52d67-b855-11e7-a9f5-000d3a045229)"

Oct 24 01:00:54 k8s-master-25635399-0 docker[1780]: I1024 01:00:54.489387    1973 kuberuntime_manager.go:748] Back-off 5m0s restarting failed container=calico-node pod=calico-node-tlbkd_kube-system(3ab52d67-b855-11e7-a9f5-000d3a045229)

Oct 24 01:00:54 k8s-master-25635399-0 docker[1780]: E1024 01:00:54.489415    1973 pod_workers.go:182] Error syncing pod 3ab52d67-b855-11e7-a9f5-000d3a045229 ("calico-node-tlbkd_kube-system(3ab52d67-b855-11e7-a9f5-000d3a045229)"), skipping: failed to "StartContainer" for "calico-node" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=calico-node pod=calico-node-tlbkd_kube-system(3ab52d67-b855-11e7-a9f5-000d3a045229)"

Oct 24 01:00:56 k8s-master-25635399-0 docker[1780]: I1024 01:00:56.001446    1973 kubelet_node_status.go:499] Using Node Hostname from cloudprovider: "k8s-master-25635399-0"

And navigating to localhost:8001/ui after running kubectl proxy says the following:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "no endpoints available for service \"kubernetes-dashboard\"",
  "reason": "ServiceUnavailable",
  "code": 503
}
zimmertr commented 6 years ago

@jackfrancis Another interesting development. immediately after updating the route table, etcd crashes as well.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://127.0.0.1:2379 has no leader\n",
  "code": 500
}

And now this is looping in journalctl -xef as etcd fails to promote a master:

ct 24 01:16:21 k8s-master-25635399-0 etcd[1239]: got unexpected response error (etcdserver: request timed out)
Oct 24 01:16:22 k8s-master-25635399-0 etcd[1239]: 3184064431492acf is starting a new election at term 120
Oct 24 01:16:22 k8s-master-25635399-0 etcd[1239]: 3184064431492acf became candidate at term 121
Oct 24 01:16:22 k8s-master-25635399-0 etcd[1239]: 3184064431492acf received vote from 3184064431492acf at term 121
Oct 24 01:16:22 k8s-master-25635399-0 etcd[1239]: 3184064431492acf [logterm: 9, index: 39420] sent vote request to d615af62b0f11ab7 at term 121
Oct 24 01:16:22 k8s-master-25635399-0 etcd[1239]: 3184064431492acf [logterm: 9, index: 39420] sent vote request to a94fb7c917d5c45 at term 121
Oct 24 01:16:24 k8s-master-25635399-0 etcd[1239]: 3184064431492acf is starting a new election at term 121
Oct 24 01:16:24 k8s-master-25635399-0 etcd[1239]: 3184064431492acf became candidate at term 122
Oct 24 01:16:24 k8s-master-25635399-0 etcd[1239]: 3184064431492acf received vote from 3184064431492acf at term 122
Oct 24 01:16:24 k8s-master-25635399-0 etcd[1239]: 3184064431492acf [logterm: 9, index: 39420] sent vote request to a94fb7c917d5c45 at term 122
Oct 24 01:16:24 k8s-master-25635399-0 etcd[1239]: 3184064431492acf [logterm: 9, index: 39420] sent vote request to d615af62b0f11ab7 at term 122
Oct 24 01:16:25 k8s-master-25635399-0 etcd[1239]: 3184064431492acf is starting a new election at term 122
Oct 24 01:16:25 k8s-master-25635399-0 etcd[1239]: 3184064431492acf became candidate at term 123
Oct 24 01:16:25 k8s-master-25635399-0 etcd[1239]: 3184064431492acf received vote from 3184064431492acf at term 123
Oct 24 01:16:25 k8s-master-25635399-0 etcd[1239]: 3184064431492acf [logterm: 9, index: 39420] sent vote request to a94fb7c917d5c45 at term 123
Oct 24 01:16:25 k8s-master-25635399-0 etcd[1239]: 3184064431492acf [logterm: 9, index: 39420] sent vote request to d615af62b0f11ab7 at term 123
Oct 24 01:16:25 k8s-master-25635399-0 etcd[1239]: got unexpected response error (etcdserver: request timed out)
Oct 24 01:16:25 k8s-master-25635399-0 docker[1780]: I1024 01:16:25.506303    1973 prober.go:106] Readiness probe for "calico-node-tlbkd_kube-system(3ab52d67-b855-11e7-a9f5-000d3a045229):calico-node" failed (failure): Get http://172.16.38.30:9099/readiness: dial tcp
zimmertr commented 6 years ago

Hello @jackfrancis and @jchauncey. Is there any information I can provide or assistance I can offer to help hasten this bug report through?

jackfrancis commented 6 years ago

@zimmertr we're heads down towards a v0.9.0 release. Could you kindly test against current master (which should reflect pretty closely what v0.9.0 will be), or are you willing to wait until the release lands (ETA tomorrow EOD)? Thanks much for your patience!

zimmertr commented 6 years ago

@jackfrancis

I have tried to deploy k8s using my templates using 76c14d1f today. Unfortunately the same issue still persists. etcd is still stuck in a failure loop for the reasons listed above.

CecileRobertMichon commented 6 years ago

I was able to reproduce both issues. The issue on master "Deployment failed. Deployment template validation failed: 'The template resource 'k8s-master-25635399-0' at line '1' and column '60028' is not valid: The language expression length limit exceeded. Limit: '24576' and actual: '24914'.. Please see https://aka.ms/arm-template-expressions for usage details.'." is due to custom data length exceeding the ARM limit. We are currently waiting for a data overhead which should solve this issue in the coming week. The issue experienced on previous builds does not seem related to etcd3 support etcd (#1564) after first investigation as the issue also occurs for c2904e0 as well. I suspect this might be related to a Calico issue.

dtzar commented 6 years ago

I don't believe there is a problem with calico itself. @CecileRobertMichon and others for reference the limit problem is documented here: https://github.com/Azure/acs-engine/issues/1159

zimmertr commented 6 years ago

@CecileRobertMichon is there anything I can do to shorten the length of my ARM template as a workaround?

24914-24576=338 338 characters is quite a bit. Perhaps even the length of all my inputted strings combined. :grimacing: Maybe I could switch to password auth instead of SSH keys or something?

I understand that this is a ARM template length restriction set by Microsoft themselves and you're waiting for that ceiling value to be increased?

dtzar commented 6 years ago

@zimmertr if you remove RBAC you won't see the length error.

zimmertr commented 6 years ago

@dtzar is that the only way? Well probably just wait for a fix if so. We need Istio so RBAC is pretty essential.

CecileRobertMichon commented 6 years ago

@zimmertr the ETA for the length increase is November 10th.

zimmertr commented 6 years ago

@CecileRobertMichon I provisioned a cluster today, 11/15, at 11:07AM using a template I generated yesterday at 9AM. Looks like I'm still having the same issue.

I was told the ETA was moved from 11/10 to 11/14 at 2PM. Is this still valid? Am I experiencing this issue as a result of the ARM Template Restriction? This is East US 2, if it helps.

kubectl get nodes

Error from server: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://127.0.0.1:2379 has no leader

etcdctl cluster-health

member 33b0bb17e22c066a is unhealthy: got unhealthy result from http://172.17.38.30:2379

kubectl cluster-info

Kubernetes master is running at https://dev-we-m.redacted.com

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

kubectl cluster-info dump

Error from server: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://127.0.0.1:2379 has no leader

journalctl -xe

Nov 15 19:59:27 k8s-master-48084675-0 docker[7056]: W1115 19:59:27.712637    8227 status_manager.go:431] Failed to get status for pod
Nov 15 19:59:27 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
Nov 15 19:59:29 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
Nov 15 19:59:30 k8s-master-48084675-0 docker[7056]: I1115 19:59:30.045913    8227 kubelet_node_status.go:503] Using Node Hostname fro
Nov 15 19:59:30 k8s-master-48084675-0 docker[7056]: W1115 19:59:30.648711    8227 cni.go:196] Unable to update cni config: No network
Nov 15 19:59:30 k8s-master-48084675-0 docker[7056]: E1115 19:59:30.648803    8227 kubelet.go:2095] Container runtime network not read
Nov 15 19:59:32 k8s-master-48084675-0 etcd[3353]: failed to reach the peerURL(http://172.17.38.32:2380) of member 7f03ea887d8488f0 (G
Nov 15 19:59:32 k8s-master-48084675-0 etcd[3353]: cannot get the version of member 7f03ea887d8488f0 (Get http://172.17.38.32:2380/ver
Nov 15 19:59:33 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
Nov 15 19:59:33 k8s-master-48084675-0 etcd[3353]: failed to reach the peerURL(http://172.17.38.31:2380) of member a4058da44ae0514c (G
Nov 15 19:59:33 k8s-master-48084675-0 etcd[3353]: cannot get the version of member a4058da44ae0514c (Get http://172.17.38.31:2380/ver
Nov 15 19:59:34 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
Nov 15 19:59:34 k8s-master-48084675-0 docker[7056]: W1115 19:59:34.714484    8227 status_manager.go:431] Failed to get status for pod
Nov 15 19:59:34 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
Nov 15 19:59:35 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
Nov 15 19:59:35 k8s-master-48084675-0 docker[7056]: W1115 19:59:35.649701    8227 cni.go:196] Unable to update cni config: No network
Nov 15 19:59:35 k8s-master-48084675-0 docker[7056]: E1115 19:59:35.650205    8227 kubelet.go:2095] Container runtime network not read
Nov 15 19:59:37 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
Nov 15 19:59:37 k8s-master-48084675-0 docker[7056]: E1115 19:59:37.050694    8227 kubelet_node_status.go:390] Error updating node sta
Nov 15 19:59:39 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
Nov 15 19:59:40 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
Nov 15 19:59:40 k8s-master-48084675-0 etcd[3353]: failed to reach the peerURL(http://172.17.38.32:2380) of member 7f03ea887d8488f0 (G
Nov 15 19:59:40 k8s-master-48084675-0 etcd[3353]: cannot get the version of member 7f03ea887d8488f0 (Get http://172.17.38.32:2380/ver
Nov 15 19:59:40 k8s-master-48084675-0 docker[7056]: W1115 19:59:40.651057    8227 cni.go:196] Unable to update cni config: No network
Nov 15 19:59:40 k8s-master-48084675-0 docker[7056]: E1115 19:59:40.651277    8227 kubelet.go:2095] Container runtime network not read
Nov 15 19:59:41 k8s-master-48084675-0 etcd[3353]: etcdserver: request timed out, possibly due to connection lost
jackfrancis commented 6 years ago

Did you get this error at deployment time?

The language expression length limit exceeded. (...)

I think the answer must be no, because you were able to actually provision infra. :)

The issue we've been tracking resolves that specific symptom vector. If you're not getting a deployment error complaining about that limit, then this is a different error.

zimmertr commented 6 years ago

@jackfrancis My source of truth on that being the issue was @CecileRobertMichon 's comment above. https://github.com/Azure/acs-engine/issues/1621#issuecomment-341775423

I have been deploying this template programmatically. Please allow me ~20 minutes to deploy manually and see if that error appears. Should I be deploying with az group deployment create in --debug mode to see that error?

jackfrancis commented 6 years ago

You don't need --debug to see that error!

CecileRobertMichon commented 6 years ago

@zimmertr sorry if my comment was confusing. What I meant was that there were two different issues: 1) the deployment issue that was due to the ARM template restriction. This is now fixed. 2) the etcd error which was not reproducible on master before because of 1), but now appears to still be an issue. When was your template last working?

zimmertr commented 6 years ago

@CecileRobertMichon Approximately 35 days ago. I can't really remember. This issue has been open for quite a while.

Here is more context:

I am using this template

That is deployed on this network schema

ACS Engine Version

Usr: tj - Wed 15,  1:53PM > acs-engine version
Version: canary
GitCommit: bbad0e13
GitTreeState: clean

Regenerating the template

Usr: tj - Wed 15,  1:53PM > acs-engine generate acs-engine/raw_json/dev-we-k8s-1.8.json --output-directory acs-engine/generated_templates/dev-we-k8s_1.8/ 
INFO[0000] Generating assets into acs-engine/generated_templates/dev-we-k8s_1.8/... 

Deploying the template

Usr: tj - Wed 15,  1:54PM > az group deployment create --name dev-we-kubernetes --resource-group dev-we --template-file acs-engine/generated_templates/dev-we-k8s_1.8/azuredeploy.json --parameters acs-engine/generated_templates/dev-we-k8s_1.8/azuredeploy.parameters.json

And, as requested by @jackfrancis , here is the successful Template Output:

Grabbing the route table

wert=$(az network route-table list -g dev-we | jq -r '.[].id')

Echoing the route table

Usr: tj - Wed 15,  2:13PM > echo $wert
/subscriptions/REDACTED/resourceGroups/dev-we/providers/Microsoft.Network/routeTables/k8s-master-48084675-routetable

Adding the k8s subnet to the route table

Usr: tj - Wed 15,  2:13PM > az network vnet subnet update -n dev-we-k8s-subnet -g dev-we --vnet-name dev-we-vnet --route-table $wert

Ouput from showing the subnet

Checking on the Kubernetes nodes

user@k8s-master-48084675-0:~$ kubectl get nodes
Error from server: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://127.0.0.1:2379 has no leader

Checking on the etcd health

user@k8s-master-48084675-0:~$ etcdctl cluster-health
member 33b0bb17e22c066a is unhealthy: got unhealthy result from http://172.17.38.30:2379

Here is an example output from journalctl -xe

Please let it be known that BEFORE updating the route table the kubernetes nodes come up and Alive and healthy from kubectl get nodes. All of this weird behavior only happens afterwards. And, at that time, if you kubectl proxy you will find this:

user@k8s-master-48084675-0:~$ curl localhost:8001/ui
<a href="/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy">Temporary Redirect</a>.

Instructions/reasons to perform this action against the subnet/route table can be found here If I can provide any more information at all please let me know. I've tried to be as clear as possible.

tanner-bruce commented 6 years ago

@zimmertr @CecileRobertMichon I can confirm that this template https://gist.github.com/tanner-bruce/79d54aae9ab47fcb3ac6cb0ae9b581cf sees the same behaviour you are seeing. The cluster launchs, and at first the nodes can speak to eachother (i.e all nodes are Ready), however after the nodes have been up about three minutes, they all start going down into a NotReady state EXCEPT the master nodes.

At this point, I can't SSH on to any of the non master nodes

tanner-bruce commented 6 years ago

After spinning up the same template but with 1.7 I am seeing this now. Once again, all nodes came up as ready and subsequently became not Ready. Interestingly, the instance in the nodes group is still marked as Ready.

k8s-edge-86417899-0        NotReady   agent     2m        v1.7.10
k8s-kafka-86417899-0       NotReady   agent     2m        v1.7.10
k8s-kafka-86417899-1       NotReady   agent     2m        v1.7.10
k8s-kafka-86417899-2       NotReady   agent     2m        v1.7.10
k8s-master-86417899-0      Ready      master    2m        v1.7.10
k8s-master-86417899-1      Ready      master    2m        v1.7.10
k8s-master-86417899-2      Ready      master    2m        v1.7.10
k8s-nifi-86417899-0        NotReady   agent     2m        v1.7.10
k8s-nodes-86417899-0       Ready      agent     2m        v1.7.10
k8s-zookeeper-86417899-0   NotReady   agent     2m        v1.7.10
k8s-zookeeper-86417899-1   NotReady   agent     2m        v1.7.10
k8s-zookeeper-86417899-2   NotReady   agent     2m        v1.7.10
jackfrancis commented 6 years ago

@zimmertr Are you willing to do a quick test and remove the "networkPolicy": "calico" configuration to see if we can home in on the offending vector?

dtzar commented 6 years ago

Unfortunately, I haven't had time to troubleshoot this. Calico needs to connect to K8s API and etcd in order to work properly and if not the nodes will not be available. Do we require SSL now to connect to K8s API from the pods or etcd (this requires an update to the calico config)? Do we require etcd 3? etcd 3 will not work unless we update to the https://github.com/projectcalico/calico/releases/tag/v3.0.0-alpha1

zimmertr commented 6 years ago

@jackfrancis I have removed that line from the JSON template passed to acs-engine and regenerated it. Upon deploying the template and adding the k8s subnet to the route table, I am greeted with the following. I am talented with UNIX internals so if you need me to do any deeper investigation into a root cause just let me know.

kubectl get nodes Before updating route table

user@k8s-master-48084675-0:~$ kubectl get nodes
NAME                    STATUS     ROLES     AGE       VERSION
k8s-agent-48084675-0    NotReady   agent     34m       v1.8.2
k8s-agent-48084675-1    NotReady   agent     34m       v1.8.2
k8s-agent-48084675-2    NotReady   agent     34m       v1.8.2
k8s-master-48084675-0   NotReady   master    45m       v1.8.2
k8s-master-48084675-1   Ready      master    45m       v1.8.2
k8s-master-48084675-2   Ready      master    45m       v1.8.2

kubectl get nodes After updating route table

user@k8s-master-48084675-0:~$ kubectl get nodes
Error from server: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://127.0.0.1:2379 has no leader

kubectl cluster-info Before updating route table

user@k8s-master-48084675-0:~$ kubectl cluster-info
Kubernetes master is running at https://mov-dev-we-m.westeurope.cloudapp.azure.com
Heapster is running at https://mov-dev-we-m.westeurope.cloudapp.azure.com/api/v1/namespaces/kube-system/services/heapster/proxy
KubeDNS is running at https://mov-dev-we-m.westeurope.cloudapp.azure.com/api/v1/namespaces/kube-system/services/kube-dns/proxy
kubernetes-dashboard is running at https://mov-dev-we-m.westeurope.cloudapp.azure.com/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy
tiller-deploy is running at https://mov-dev-we-m.westeurope.cloudapp.azure.com/api/v1/namespaces/kube-system/services/tiller-deploy/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

kubectl cluster-info After updating route table

user@k8s-master-48084675-0:~$ kubectl cluster-info
Kubernetes master is running at https://mov-dev-we-m.westeurope.cloudapp.azure.com

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

etcdctl cluster-health After updating route table

user@k8s-master-48084675-0:~$ etcdctl cluster-health
member 33b0bb17e22c066a is unhealthy: got unhealthy result from http://172.17.38.30:2379

Journalctl -xe Before updating route table

Nov 16 23:43:45 k8s-master-48084675-0 docker[7046]: E1116 23:43:45.227029    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:43:49 k8s-master-48084675-0 docker[7046]: I1116 23:43:49.171337    8423 kubelet_node_status.go:503] Using Node Hostname from cloudprovider: "k8s-master-48084675-0"
Nov 16 23:43:49 k8s-master-48084675-0 docker[7046]: I1116 23:43:49.441822    8423 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Nov 16 23:43:50 k8s-master-48084675-0 docker[7046]: E1116 23:43:50.228279    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:43:55 k8s-master-48084675-0 docker[7046]: E1116 23:43:55.229131    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:43:59 k8s-master-48084675-0 docker[7046]: I1116 23:43:59.558602    8423 kubelet_node_status.go:503] Using Node Hostname from cloudprovider: "k8s-master-48084675-0"
Nov 16 23:44:00 k8s-master-48084675-0 docker[7046]: E1116 23:44:00.230219    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:05 k8s-master-48084675-0 docker[7046]: E1116 23:44:05.231284    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:10 k8s-master-48084675-0 docker[7046]: E1116 23:44:10.232406    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:10 k8s-master-48084675-0 docker[7046]: I1116 23:44:10.283195    8423 kubelet_node_status.go:503] Using Node Hostname from cloudprovider: "k8s-master-48084675-0"
Nov 16 23:44:15 k8s-master-48084675-0 docker[7046]: E1116 23:44:15.233462    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:20 k8s-master-48084675-0 docker[7046]: E1116 23:44:20.234372    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:20 k8s-master-48084675-0 docker[7046]: I1116 23:44:20.676023    8423 kubelet_node_status.go:503] Using Node Hostname from cloudprovider: "k8s-master-48084675-0"
Nov 16 23:44:25 k8s-master-48084675-0 docker[7046]: E1116 23:44:25.235722    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:30 k8s-master-48084675-0 docker[7046]: E1116 23:44:30.237944    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:31 k8s-master-48084675-0 docker[7046]: I1116 23:44:31.181439    8423 kubelet_node_status.go:503] Using Node Hostname from cloudprovider: "k8s-master-48084675-0"
Nov 16 23:44:35 k8s-master-48084675-0 docker[7046]: E1116 23:44:35.239285    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:40 k8s-master-48084675-0 docker[7046]: E1116 23:44:40.240299    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:41 k8s-master-48084675-0 docker[7046]: I1116 23:44:41.620354    8423 kubelet_node_status.go:503] Using Node Hostname from cloudprovider: "k8s-master-48084675-0"
Nov 16 23:44:45 k8s-master-48084675-0 docker[7046]: E1116 23:44:45.241237    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:49 k8s-master-48084675-0 docker[7046]: I1116 23:44:49.442225    8423 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Nov 16 23:44:50 k8s-master-48084675-0 docker[7046]: E1116 23:44:50.242256    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:44:52 k8s-master-48084675-0 docker[7046]: I1116 23:44:52.173091    8423 kubelet_node_status.go:503] Using Node Hostname from cloudprovider: "k8s-master-48084675-0"
Nov 16 23:44:55 k8s-master-48084675-0 docker[7046]: E1116 23:44:55.243249    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:45:00 k8s-master-48084675-0 docker[7046]: E1116 23:45:00.244213    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo

journalctl -xe After updating route table

Nov 16 23:46:41 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 37
Nov 16 23:46:41 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
Nov 16 23:46:42 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 37
Nov 16 23:46:42 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 38
Nov 16 23:46:42 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 38
Nov 16 23:46:42 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 38
Nov 16 23:46:42 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 38
Nov 16 23:46:44 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 38
Nov 16 23:46:44 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 39
Nov 16 23:46:44 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 39
Nov 16 23:46:44 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 39
Nov 16 23:46:44 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 39
Nov 16 23:46:45 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 39
Nov 16 23:46:45 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 40
Nov 16 23:46:45 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 40
Nov 16 23:46:45 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 40
Nov 16 23:46:45 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 40
Nov 16 23:46:45 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
Nov 16 23:46:45 k8s-master-48084675-0 docker[7046]: E1116 23:46:45.267308    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:46:46 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
Nov 16 23:46:46 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 40
Nov 16 23:46:46 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 41
Nov 16 23:46:46 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 41
Nov 16 23:46:46 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 41
Nov 16 23:46:46 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 41
Nov 16 23:46:47 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
Nov 16 23:46:47 k8s-master-48084675-0 docker[7046]: E1116 23:46:47.016271    8423 kubelet_node_status.go:390] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\
Nov 16 23:46:47 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
Nov 16 23:46:47 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 41
Nov 16 23:46:47 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 42
Nov 16 23:46:47 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 42
Nov 16 23:46:47 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 42
Nov 16 23:46:47 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 42
Nov 16 23:46:49 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 42
Nov 16 23:46:49 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 43
Nov 16 23:46:49 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 43
Nov 16 23:46:49 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 43
Nov 16 23:46:49 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 43
Nov 16 23:46:49 k8s-master-48084675-0 docker[7046]: I1116 23:46:49.442896    8423 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Nov 16 23:46:50 k8s-master-48084675-0 docker[7046]: E1116 23:46:50.268344    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:46:50 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 43
Nov 16 23:46:50 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 44
Nov 16 23:46:50 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 44
Nov 16 23:46:50 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 44
Nov 16 23:46:50 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 44
Nov 16 23:46:51 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
Nov 16 23:46:51 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 44
Nov 16 23:46:51 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 45
Nov 16 23:46:51 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 45
Nov 16 23:46:51 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 45
Nov 16 23:46:51 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 45
Nov 16 23:46:53 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 45
Nov 16 23:46:53 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 46
Nov 16 23:46:53 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 46
Nov 16 23:46:53 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 46
Nov 16 23:46:53 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 46
Nov 16 23:46:53 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
Nov 16 23:46:54 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
Nov 16 23:46:54 k8s-master-48084675-0 docker[7046]: E1116 23:46:54.018640    8423 kubelet_node_status.go:390] Error updating node status, will retry: error getting node "k8s-master-48084675-0": client: etcd cluster is unavailable or misconfigured; error #0: client: etcd m
Nov 16 23:46:54 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
Nov 16 23:46:54 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 46
Nov 16 23:46:54 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 47
Nov 16 23:46:54 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 47
Nov 16 23:46:54 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 47
Nov 16 23:46:54 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 47
Nov 16 23:46:55 k8s-master-48084675-0 docker[7046]: E1116 23:46:55.269354    8423 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is mo
Nov 16 23:46:55 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 47
Nov 16 23:46:55 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 48
Nov 16 23:46:55 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 48
Nov 16 23:46:55 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 48
Nov 16 23:46:55 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 48
Nov 16 23:46:57 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a is starting a new election at term 48
Nov 16 23:46:57 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a became candidate at term 49
Nov 16 23:46:57 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a received vote from 33b0bb17e22c066a at term 49
Nov 16 23:46:57 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to 7f03ea887d8488f0 at term 49
Nov 16 23:46:57 k8s-master-48084675-0 etcd[3287]: 33b0bb17e22c066a [logterm: 3, index: 35813] sent vote request to a4058da44ae0514c at term 49
Nov 16 23:46:57 k8s-master-48084675-0 etcd[3287]: got unexpected response error (etcdserver: request timed out)
dtzar commented 6 years ago

I'm looking into this today, collaborating with Jack.

zimmertr commented 6 years ago

Thanks @dtzar. I hope it's just user error. I understand that the acs-engine project probably isn't high priority anymore with AKS right around the corner also.

dtzar commented 6 years ago

I just provisioned a new cluster from master with calico using this template and everything is healthy (all pods in kube-system running, all nodes ready).

@zimmertr what is the route table command(s) you are doing and against what node(s)?

{
    "apiVersion": "vlabs",
    "properties": {
      "orchestratorProfile": {
        "orchestratorType": "Kubernetes",
        "kubernetesConfig": {
          "networkPolicy": "calico"
        }
      },
      "masterProfile": {
        "count": 1,
        "dnsPrefix": "calico2617",
        "vmSize": "Standard_DS2_v2"
      },
      "agentPoolProfiles": [
        {
          "name": "agentpool1",
          "count": 2,
          "vmSize": "Standard_DS2_v2",
          "storageProfile" : "ManagedDisks",
          "availabilityProfile": "AvailabilitySet"
        }
      ],
...
}
dtzar commented 6 years ago

I know @zimmertr and @tanner-bruce are doing calico + custom vnets. Just wanted to make sure calico + K8s without custom subnets works against the current master commit and it does with the above template and this one below (adds 1.8 + RBAC). Trying to reproduce the problem now with the vnet subnet sample weaved into the template below which is more similar to both your configurations.


{
    "apiVersion": "vlabs",
    "properties": {
      "orchestratorProfile": {
        "orchestratorType": "Kubernetes",
        "orchestratorRelease": "1.8",
        "kubernetesConfig": {
          "enableRbac": true,
          "networkPolicy": "calico"
        }
      },
      "masterProfile": {
        "count": 1,
        "dnsPrefix": "calicoall",
        "vmSize": "Standard_DS2_v2"
      },
      "agentPoolProfiles": [
        {
          "name": "agentpool1",
          "count": 2,
          "vmSize": "Standard_DS2_v2",
          "storageProfile" : "ManagedDisks",
          "availabilityProfile": "AvailabilitySet"
        }
      ],
      "linuxProfile": {
        "adminUsername": "azureuser",
        "ssh": {
          "publicKeys": [
            {
              "keyData": "REDACTED"
            }
          ]
        }
      },
      "servicePrincipalProfile": {
        "clientId": "REDACTED",
        "secret": "REDACTED"
      }
    }
  }
dtzar commented 6 years ago

Ok, I deployed the below template and everything is still working fine for me. I used the pre/post scripts and subnet from the examples/vnet section of the repo. The next step would be to test an updated template which adds another subnet and perhaps puts the master/agent nodes onto the 2nd subnet (like @zimmertr does).

@zimmertr - It would be good to get exactly what routes your gathering and how your applying them since everything breaks for you after this. Also - if you're open it would be good to work up from the example template I have below to mimic more of your scenario in a more minimal form which reproduces the behavior. Also not sure if it would help (or break things), but have you also tried removing the vnetCidr and/or clusterSubnet from your base template?

@tanner-bruce you are not using calico (you don't need to, but this isn't playing a role here) and you also have not specified vnetSubnetId for any of your agentpools so this will definitely fail as you have it.

{
    "apiVersion": "vlabs",
    "properties": {
      "orchestratorProfile": {
        "orchestratorType": "Kubernetes",
        "orchestratorRelease": "1.8",
        "kubernetesConfig": {
          "enableRbac": true,
          "networkPolicy": "calico"
        }
      },
      "masterProfile": {
        "count": 1,
        "dnsPrefix": "calicovnetk8s",
        "vmSize": "Standard_DS2_v2",
        "vnetSubnetId": "/subscriptions/REDACTED/resourceGroups/acsk8svnet/providers/Microsoft.Network/virtualNetworks/KubernetesCustomVNET/subnets/KubernetesSubnet",
        "firstConsecutiveStaticIP": "10.239.255.239"
      },
      "agentPoolProfiles": [
        {
          "name": "agentpool1",
          "count": 2,
          "vmSize": "Standard_DS2_v2",
          "storageProfile" : "ManagedDisks",
          "vnetSubnetId": "/subscriptions/REDACTED/resourceGroups/acsk8svnet/providers/Microsoft.Network/virtualNetworks/KubernetesCustomVNET/subnets/KubernetesSubnet",
          "availabilityProfile": "AvailabilitySet"
        }
      ],
      "linuxProfile": {
        "adminUsername": "azureuser",
        "ssh": {
          "publicKeys": [
            {
              "keyData": "REDACTED"
            }
          ]
        }
      },
      "servicePrincipalProfile": {
        "clientId": "REDACTED",
        "secret": "REDACTED"
      }
    }
  }
jalberto commented 6 years ago

I think I am having a related issue in #2022