Closed jalberto closed 5 years ago
maybe related:
Is it possible to run manually cloud-init to recreate what is missing?
etcd logs:
Mar 30 16:32:06 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 received vote from 134af6e5d3e35861 at term 4430
Mar 30 16:32:06 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 2a0965170762fd09 at term 4430
Mar 30 16:32:06 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 91d25258d70ad868 at term 4430
Mar 30 16:32:07 k8s-master-11577755-0 etcd[48953]: publish error: etcdserver: request timed out
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 is starting a new election at term 4430
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 became candidate at term 4431
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 received vote from 134af6e5d3e35861 at term 4431
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 2a0965170762fd09 at term 4431
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 91d25258d70ad868 at term 4431
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 is starting a new election at term 4431
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 became candidate at term 4432
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 received vote from 134af6e5d3e35861 at term 4432
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 2a0965170762fd09 at term 4432
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 91d25258d70ad868 at term 4432
kubelet logs:
Mar 30 17:08:49 k8s-master-11577755-0 systemd[1]: Started Kubelet.
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: Flag --non-masquerade-cidr has been deprecated, will be removed in a future version
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: Flag --keep-terminated-pod-volumes has been deprecated, will be removed in a future version
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: Flag --non-masquerade-cidr has been deprecated, will be removed in a future version
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: I0330 17:08:50.064943 5595 feature_gate.go:162] feature gates: map[]
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: I0330 17:08:50.065489 5595 mount_linux.go:196] Detected OS without systemd
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: I0330 17:08:50.065523 5595 client.go:75] Connecting to docker on unix:///var/run/docker.sock
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: I0330 17:08:50.065572 5595 client.go:95] Start docker client with request timeout=2m0s
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: W0330 17:08:50.066686 5595 cni.go:196] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: Error: failed to run Kubelet: could not init cloud not init cloud provider "azure": No credentials provided for AAD applicationcredentials provided for AAD appliad units
so I change any reference to "2.5.2" to "2.3.7" in my _output
dir, and try to upgrade to k8s 1.9, and surprise, etcd 2.5.2 still trying to be installed in master
Hi @jalberto thanks for your courage. What's the first thing we can look at?
@jackfrancis thanks for your time
IMHO:
Let's figure out why the 2.5.2
etcd bug is still present, that was fixed a while ago. What does "etcd 2.5.2 still trying to be installed in master" mean exactly?
when running the upgrade command in a working cluster with etcd 2.3.7 new acs-engine crates the file /opt/azure/containers/setup-etcd.sh
with this content:
#!/bin/bash
set -x
source /opt/azure/containers/provision_source.sh
ETCD_VER=v2.5.2
DOWNLOAD_URL=https://acs-mirror.azureedge.net/github-coreos
retrycmd_if_failure 5 5 curl --retry 5 --retry-delay 10 --retry-max-time 30 -L ${DOWNLOAD_URL}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /usr/bin/ --strip-components=1
systemctl daemon-reload
systemctl enable etcd.service
sudo sed -i "1iETCDCTL_ENDPOINTS=https://127.0.0.1:2379" /etc/environment
sudo sed -i "1iETCDCTL_CA_FILE=/etc/kubernetes/certs/ca.crt" /etc/environment
sudo sed -i "1iETCDCTL_KEY_FILE=/etc/kubernetes/certs/etcdclient.key" /etc/environment
sudo sed -i "1iETCDCTL_CERT_FILE=/etc/kubernetes/certs/etcdclient.crt" /etc/environment
clearly the curl command faills to fetch that version, but the script continues running to end so acs-engine it was successful.
@jackfrancis is there a way to run a commadn in master to recreate the initial provisioning commands? (previous remove of /opt/azure/containers/*.complete
files)
Why? becasue if manually I go into each amster, and change that value, and re-run setup_etcd.sh
successfully, the global state of master is inconsistent (as etcd wassn't ready to continue some key steps in the process)
BTW point 1 can be achived just exiting of any script if there is any error, at least that will stop the process
There is no way to recreate an original provisioning, no. The only semi-notion of cluster "state" (quotations intentional) lives in the api model on the client side, which as you've discovered is only a fractional representation of the actual cluster, especially w/ respect to the original api model representation vs a newer version of acs-engine.
The ETCD_VER=v2.5.2
derives from the value of etcdVersion
, in the kubernetesConfig
at the time of template generation. E.g., in your api model:
<etc>
"kubernetesConfig" {
"etcdVersion": "3.2.16"
}
<etc>
@jackfrancis this can be related: https://github.com/kubernetes/kubernetes/issues/54918
I am using calico and no fiel is being created in etc/cni
how can I install manually?
@jackfrancis I understand it should come from there, but before I run upgrade, I edited my _output/foo/apimodel.json
to fix etcd version.
an suggestion? my cluster is down for so many hours now
@jackfrancis more context:
root@k8s-master-11577755-0:~# docker ps --format '{{.Image}} - {{.Names}}'
gcrio.azureedge.net/google_containers/hyperkube-amd64:v1.9.5 - wizardly_bartik
gcrio.azureedge.net/google_containers/hyperkube-amd64@sha256:a31961a719a1d0ade89149a6a8db5181cbef461baa6ef049681c31c0e48d9f1e - k8s_kube-controller-manager_kube-controller-manager-k8s-master-11577755-0_kube-system_beaaf22644028e3842cf0847ccb58d15_1
k8s-gcrio.azureedge.net/kube-addon-manager-amd64@sha256:3519273916ba45cfc9b318448d4629819cb5fbccbb0822cce054dd8c1f68cb60 - k8s_kube-addon-manager_kube-addon-manager-k8s-master-11577755-0_kube-system_61d4fa32deceb6175822bf42fb7410f2_1
gcrio.azureedge.net/google_containers/hyperkube-amd64@sha256:a31961a719a1d0ade89149a6a8db5181cbef461baa6ef049681c31c0e48d9f1e - k8s_kube-scheduler_kube-scheduler-k8s-master-11577755-0_kube-system_27fb8458832c33c3b8754aca44f00158_1
k8s-gcrio.azureedge.net/pause-amd64:3.1 - k8s_POD_kube-controller-manager-k8s-master-11577755-0_kube-system_beaaf22644028e3842cf0847ccb58d15_1
k8s-gcrio.azureedge.net/pause-amd64:3.1 - k8s_POD_kube-apiserver-k8s-master-11577755-0_kube-system_a0596cf3432e0574f040528197bc3441_1
k8s-gcrio.azureedge.net/pause-amd64:3.1 - k8s_POD_kube-addon-manager-k8s-master-11577755-0_kube-system_61d4fa32deceb6175822bf42fb7410f2_1
k8s-gcrio.azureedge.net/pause-amd64:3.1 - k8s_POD_kube-scheduler-k8s-master-11577755-0_kube-system_27fb8458832c33c3b8754aca44f00158_1
I0330 19:08:21.516738 35291 kubelet.go:316] Watching apiserver
E0330 19:08:21.517265 35291 file.go:149] Can't process manifest file "/etc/kubernetes/manifests/audit-policy.yaml": /etc/kubernetes/manifests/audit-policy.yaml: couldn't parse as pod(no kind "Policy" is registered for version "audit.k8s.io/v1beta1"), please check manifest file.
E0330 19:08:21.545031 35291 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.240.255.5:443/api/v1/pods?fieldSelector=spec.nodeName%3Dk8s-master-11577755-0&limit=500&resourceVersion=0: dial tcp 10.240.255.5:443: getsockopt: connection refused
E0330 19:08:21.545778 35291 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:480: Failed to list *v1.Node: Get https://10.240.255.5:443/api/v1/nodes?fieldSelector=metadata.name%3Dk8s-master-11577755-0&limit=500&resourceVersion=0: dial tcp 10.240.255.5:443: getsockopt: connection refused
E0330 19:08:21.546344 35291 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:471: Failed to list *v1.Service: Get https://10.240.255.5:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.240.255.5:443: getsockopt: connection refused
W0330 19:08:21.569904 35291 kubelet_network.go:139] Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back to "hairpin-veth"
I0330 19:08:21.569952 35291 kubelet.go:577] Hairpin mode set to "hairpin-veth"
W0330 19:08:21.570207 35291 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
I0330 19:08:21.570229 35291 plugins.go:190] Loaded network plugin "cni"
I0330 19:08:21.570291 35291 client.go:80] Connecting to docker on unix:///var/run/docker.sock
I0330 19:08:21.570305 35291 client.go:109] Start docker client with request timeout=2m0s
W0330 19:08:21.572091 35291 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
W0330 19:08:21.575640 35291 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
I0330 19:08:21.575711 35291 plugins.go:190] Loaded network plugin "cni"
I0330 19:08:21.575838 35291 docker_service.go:232] Docker cri networking managed by cni
All of this suggests an api model that is not able to be easily reconcilable with current versions of acs-engine. (Of course this is not idea, just a reflection of the current limitations of what acs-engine does reliably.)
Are you able to build a new cluster and install your workloads on it?
@jackfrancis only if I am able to move the data from the PVs to the new cluster
@jackfrancis this is my apimodel.json
{
"apiVersion": "vlabs",
"location": "westeurope",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.8",
"orchestratorVersion": "1.8.10",
"kubernetesConfig": {
"kubernetesImageBase": "gcrio.azureedge.net/google_containers/",
"clusterSubnet": "10.244.0.0/16",
"dnsServiceIP": "10.240.255.254",
"serviceCidr": "10.240.0.0/16",
"networkPolicy": "calico",
"maxPods": 50,
"dockerBridgeSubnet": "172.17.0.1/16",
"useInstanceMetadata": true,
"enableRbac": true,
"enableSecureKubelet": true,
"privateCluster": {
"enabled": false
},
"gchighthreshold": 85,
"gclowthreshold": 80,
"etcdVersion": "2.3.7",
"etcdDiskSizeGB": "128",
"addons": [
{
"name": "tiller",
"enabled": true,
"containers": [
{
"name": "tiller",
"cpuRequests": "50m",
"memoryRequests": "150Mi",
"cpuLimits": "50m",
"memoryLimits": "150Mi"
}
],
"config": {
"max-history": "0"
}
},
{
"name": "aci-connector",
"enabled": false,
"containers": [
{
"name": "aci-connector",
"cpuRequests": "50m",
"memoryRequests": "150Mi",
"cpuLimits": "50m",
"memoryLimits": "150Mi"
}
],
"config": {
"nodeName": "aci-connector",
"os": "Linux",
"region": "westus",
"taint": "azure.com/aci"
}
},
{
"name": "kubernetes-dashboard",
"enabled": true,
"containers": [
{
"name": "kubernetes-dashboard",
"cpuRequests": "300m",
"memoryRequests": "150Mi",
"cpuLimits": "300m",
"memoryLimits": "150Mi"
}
]
},
{
"name": "rescheduler",
"enabled": false,
"containers": [
{
"name": "rescheduler",
"cpuRequests": "10m",
"memoryRequests": "100Mi",
"cpuLimits": "10m",
"memoryLimits": "100Mi"
}
]
},
{
"name": "metrics-server",
"enabled": false,
"containers": [
{
"name": "metrics-server"
}
]
}
],
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.240.255.254",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "110",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.0.0.0/8",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
},
"controllerManagerConfig": {
"--allocate-node-cidrs": "true",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-cidr": "10.244.0.0/16",
"--cluster-name": "k8svl",
"--cluster-signing-cert-file": "/etc/kubernetes/certs/ca.crt",
"--cluster-signing-key-file": "/etc/kubernetes/certs/ca.key",
"--feature-gates": "",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--node-monitor-grace-period": "40s",
"--pod-eviction-timeout": "5m0s",
"--profiling": "false",
"--root-ca-file": "/etc/kubernetes/certs/ca.crt",
"--route-reconciliation-period": "10s",
"--service-account-private-key-file": "/etc/kubernetes/certs/apiserver.key",
"--terminated-pod-gc-threshold": "5000",
"--use-service-account-credentials": "true",
"--v": "2"
},
"cloudControllerManagerConfig": {
"--allocate-node-cidrs": "true",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-cidr": "10.244.0.0/16",
"--cluster-name": "k8svl",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--route-reconciliation-period": "10s",
"--v": "2"
},
"apiServerConfig": {
"--admission-control": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota,DenyEscalatingExec,AlwaysPullImages",
"--advertise-address": "<kubernetesAPIServerIP>",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--audit-log-maxage": "30",
"--audit-log-maxbackup": "10",
"--audit-log-maxsize": "100",
"--audit-log-path": "/var/log/audit.log",
"--audit-policy-file": "/etc/kubernetes/manifests/audit-policy.yaml",
"--authorization-mode": "Node,RBAC",
"--bind-address": "0.0.0.0",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--etcd-cafile": "/etc/kubernetes/certs/ca.crt",
"--etcd-certfile": "/etc/kubernetes/certs/etcdclient.crt",
"--etcd-keyfile": "/etc/kubernetes/certs/etcdclient.key",
"--etcd-quorum-read": "true",
"--etcd-servers": "https://127.0.0.1:2379",
"--insecure-port": "8080",
"--kubelet-client-certificate": "/etc/kubernetes/certs/client.crt",
"--kubelet-client-key": "/etc/kubernetes/certs/client.key",
"--profiling": "false",
"--repair-malformed-updates": "false",
"--secure-port": "443",
"--service-account-key-file": "/etc/kubernetes/certs/apiserver.key",
"--service-account-lookup": "true",
"--service-cluster-ip-range": "10.240.0.0/16",
"--storage-backend": "etcd2",
"--tls-cert-file": "/etc/kubernetes/certs/apiserver.crt",
"--tls-private-key-file": "/etc/kubernetes/certs/apiserver.key",
"--v": "4"
}
}
},
"masterProfile": {
"count": 3,
"dnsPrefix": "k8svl",
"vmSize": "Standard_D2_v2",
"firstConsecutiveStaticIP": "10.240.255.5",
"storageProfile": "ManagedDisks",
"oauthEnabled": false,
"preProvisionExtension": null,
"extensions": [],
"distro": "ubuntu",
"kubernetesConfig": {
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.240.255.254",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "110",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.0.0.0/8",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
}
}
},
"agentPoolProfiles": [
{
"name": "pool01",
"count": 3,
"vmSize": "Standard_DS4_v2",
"osType": "Linux",
"availabilityProfile": "AvailabilitySet",
"storageProfile": "ManagedDisks",
"distro": "ubuntu",
"kubernetesConfig": {
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.240.255.254",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "Accelerators=true",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "110",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.0.0.0/8",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
}
},
"fqdn": "",
"preProvisionExtension": null,
"extensions": []
}
],
"linuxProfile": {
"adminUsername": "foo",
"ssh": {
"publicKeys": [
{
"keyData": ""
}
]
}
},
"servicePrincipalProfile": {
"clientId": "",
"secret": ""
},
"certificateProfile": {
"caCertificate": "",
"caPrivateKey": "",
"apiServerCertificate": "",
"apiServerPrivateKey": "",
"clientCertificate": "",
"clientPrivateKey": "",
"kubeConfigCertificate": "",
"kubeConfigPrivateKey": "",
"etcdServerCertificate": "",
"etcdServerPrivateKey": "",
"etcdClientCertificate": "",
"etcdClientPrivateKey": "",
"etcdPeerCertificates": [
"",
"",
"",
""
]
}
}
}
@jackfrancis I generated a new apimodel.json
with latest acs-engine but not significant change in structure:
So seems not a problem related to my cluster configuration
more logs (cloud init output)
+ bash /etc/kubernetes/generate-proxy-certs.sh
[...]
subject=/CN=aggregator/O=system:masters
Getting CA Private Key
seq: invalid floating point argument: 'etcdctl'
Try 'seq --help' for more information.
Executed "" times
Error: client: etcd cluster is unavailable or misconfigured
error #0: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #2: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
Error: client: etcd cluster is unavailable or misconfigured
error #0: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #2: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
Error: client: etcd cluster is unavailable or misconfigured
error #0: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #2: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
@jackfrancis I achieved to upgrade to 1.9.6 with correct etc 2.3.7 but still not working, as no cni config found and the errors in previous comment.. is the script or the cloud-init config stored in the VM? I will liek to just run it manually
I ran an ad hoc test against your api model (acs-engine:v0.14.5) and couldn't get a working cluster:
$ kubectl get nodes -o json
2018/03/30 20:44:08 Error trying to run 'kubectl get nodes':{
"apiVersion": "v1",
"items": [],
"kind": "List",
"metadata": {
"resourceVersion": "",
"selfLink": ""
}
}
About manually setting up calico.
make sure KUBELET_NETWORK_PLUGIN=cni
in /etc/default/kubelet
make sure DOCKER_OPTS= --volume=/etc/cni/:/etc/cni:ro --volume=/opt/cni/:/opt/cni:ro
in /etc/default/kubelet
Calico does not currently work w/ CNI, so your kubelet runtime config should be --network-plugin=kubenet
ar least you have connectivity :)
must I change calico for azure?
KUBELET_NETOWRK_PLUGIN
was not in /etc/default/kubelet
I added it, but this flag is there --network-plugin=cni
the volume flags are present, but host /etc/cni
is empty
If you're using Calico for k8s networkPolicy you have to use kubenet for IPAM. So change to --network-plugin=kubenet
@jackfrancis already changed, and rebooted the VM, nothing new.
in other hand, I cannot create a new cluster becasue there is a lack of VMs in westeu, already tried 3 sizes
@jackfrancis I mind to remove calico if it solves the issue
Here is the provision script we run, for reference, if you want to try replaying things manually:
https://github.com/Azure/acs-engine/blob/master/parts/k8s/kubernetesmastercustomscript.sh
(also in /opt/azure/containers/provision.sh
)
The configNetworkPolicy
is where the various network options are applied on the host.
@jackfrancis
ETCD_PEER_CERT=$(echo ${ETCD_PEER_CERTIFICATES} | cut -d'[' -f 2 | cut -d']' -f 1 | cut -d',' -f $((${MASTER_INDEX}+1)))
ETCD_PEER_KEY=$(echo ${ETCD_PEER_PRIVATE_KEYS} | cut -d'[' -f 2 | cut -d']' -f 1 | cut -d',' -f $((${MASTER_INDEX}+1)))
returns nothing, and thst just the beginning of the script
I really am wondering why this script is not just exiting if any required value is not present instead of just blindly run commands
I suspect there is other script to setup all this empty vars before this script runs.
Any clue?
every key in /etc/kubernetes/azure.json
is empty so looks liek somethign in provision step is not working
I opened a critical case in azure, I just cannot continue havin ga down cluster with no solution, thanks for you help @jackfrancis
I am really upset with acs-engine i hope the same code is not used in aks, or it will be scary to use
acs-engine comes with no operational guarantees. This is the contract users make w/ this tool. I understand why your experience is upsetting, we take this as valuable feedback that our documentation is not adequately expressing the limits of what will and may not work w/ respect to cluster lifecycle management driven by aca-engine only.
It is precisely these kinds of limitations in acs-engine which inform AKS as a service value proposition above and beyond; and inform the additional orchestration implementations that separate AKS from acs-engine.
AKS uses a subset of acs-engine as an SDK-like library dependency, and not the entire codebase as-is, which is the workflow you’re currently suffering through.
To repeat and emphasize for clarity and transparency (not because it is what you want to hear): acs-engine is a collaborative, open source project to facilitate rapid development of Kubernetes (and other container cluster kit) on Azure. The only support model is the PR process, either through the issue process, advocacy, and project maintainer response in the form of PR + release, or the submission and acceptance of your own PR. By design, this model produces changes in code (and potentially against your existing clusters, or onto novel clusters) over the course of days/weeks/months; but not in an appropriate amount of time to facilitate a production operation response.
Once Azure identifies this as a customer-built and maintained cluster, they will deprioritize the issue you’ve opened.
Again, sorry that there is no good news here w/ respect to your current situation, I hope that this transparency is helpful, especially in the long term as it pertains to whether or not acs-engine is an appropriate tool for your Kubernetes cluster management toolkit.
@jackfrancis I understand what you say, and I understand the risk of using ACS-engine, that said, what I don't understand is how a MS-Azure product/project (because it is under MS umbrella) is in this shape. I also don't understand how the process of releasing "stable" versions is goging, where clearly not every previous case has been taken in consideration before release. We can talk about the problems reported and never fixed here, actually the main problem with this ticket was reported by me months ago, I spent a considerable amount of time digging and pointing out possible solutions but nothing has been fixed.
At same time acs-engien declares this is a "non supported product" says "this is community driven and we listen to your feedback an implement it" if you check my issues open here, you will see I spent considerable amount of time giving feedback adn tangible solutions, but nothing gets implemented nor fixed.
What i learnt today is: acs-engine is not supported but is not either community driven as it follows a roadmap not decide by the community. so it's a "community project driven by azure product interest" and that never works.
Please don't misunderstand my upset with how MS/azure/acs deals with all this problems with a lack of appreciation for the work of the team, but acs team needs more clear direction and isolation of community driven vs product driven.
Final feedback put in big/bold/shiny in 1st line of README "Don't use this for production workloads under any circumstance"
These are valid criticisms and reflect immature aspects of this project:
This project started out as a "let's see what happens when we open source the Azure ARM template conveniences to the OSS community on Azure that is interested in prototyping container orchestrator clusters"; i.e., the intent of the open source aspects of the project was intrinsically experimental, rather than a purposeful project with a specific Microsoft-desired outcome. The intent of doing this in the open was to empower folks who were impatient with the maturation process of SLA-backed Azure service offerings (e.g., AKS, which is not yet GA) but whose business goals aligned with this particular tech stack category (e.g., Kubernetes, docker).
Arguably, we can do more to engage community contributions, and improve the above criticisms. We take that feedback seriously.
Consider, though, that the primary objective of this project is to enable folks to iterate and build upon each others' ideas and work to produce novel cluster deployments in Azure. To that end this project continues to add value, with the risk associated with all the above-mentioned caveats.
I would accept the feedback that more disclaimer material would be valuable to warn folks about the support model, but I would push back on your representation of what's dangerous. It's not "production workloads" that acs-engine is operating against: for that the Kubernetes API and the way it is configured that matter. The hard work is rationalizing the Azure API w/ Kubernetes-supporting IaaS + Kubernetes runtime config. The intention of acs-engine is to provide tooling for the user to achieve the outcome of a working Kubernetes cluster on Azure. Once that outcome has been achieved, whether or not production workloads should be scheduled to a particular cluster is really determinant upon the viability of that cluster's configuration as compared to the requirements of the workloads that may land there, including the configuration of the IaaS config underneath. This is not an acs-engine problem: acs-engine merely aims to make the process of defining, declaring, and applying these IaaS + k8s configurations onto Azure.
I would agree, however, that upgrade + scale functionality in acs-engine in its current state is not an acceptable cluster lifecycle management dependency for a production cluster. Whether or not its limitations are more or less reliable than a hand-rolled cluster lifecycle toolkit is up to the discretion of each user. That reality can be better documented, and we will do so.
Thanks for your continued feedback!
Thanks for your time and sincerity @jackfrancis I totally understand the complexity of the project but I also expect high quality outcome from a MS driven project. I think main issue is in this words:
"The intention of acs-engine is to provide tooling for the user to achieve the outcome of a working Kubernetes cluster on Azure."
The meaning of "working Kubernetes cluster on Azure" can mean different things for each one, for me it includes maintenance, not necessarily "major upgrade" support thou, but at least proper troubleshooting options and basic maintenance tasks to ensure a "working Kubernetes cluster on Azure" through time, and not only "once"
Maybe listing which features are stable and which not will help, so at least expectations are managed.
Thanks for your time
@jackfrancis this is another example: #1961
The change is justified, but is a breaking change, not properly documented and without upgrade path attached.
This change make me waste 3 days trying to figure out why gitlab-runner is not working anymore
@jackfrancis this table is an accurate visualisation of my frustration with acs-engine:
So the only way to "fix" a 1.8.4 cluster is to upgrade at least to 1.8.5 so azure-file works as expected (tha this what I was trying to fix with this upgrade) but if you jump "too much" to 1.9.0 it breaks again!
Notice how I need to go to official MS-azure docs to find information about community-driven acs-engine
Hi @jalberto, for the azure file fileMode and dirMode issue, there is design change back and forth, I would suggest use azure file mountOptions to set what you want: https://github.com/andyzhangx/demo/blob/master/linux/azurefile/azurefile-mountoptions.md
@andyzhangx agree, the problem is you need at least k8s 1.8.5 to use mountOptions
so when I upgraded my cluster from 1.8.4 to 1.8.x everything breaks
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.
Is this a request for help?: YES & a BUG REPORT
Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE
What version of acs-engine?: 0.14.5
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.4 to 1.8.10
What happened: run `upgrade command following #2062 have lots of troubles:
What you expected to happen: to work
How to reproduce it (as minimally and precisely as possible): just try to upgrade an existing cluster
Anything else we need to know: this is really critical, as my prod cluster is down right now