Open AlexGrs opened 7 years ago
Right now, you would need to perform the upgrade manually. In the coming months we will be building managed updates in the ACS service.
If you wanted to upgrade manually, you'd want to SSH to each node, edit /etc/systemd/system/kubelet.service
and change the referenced hyperkube image version. (And possibly upgrade kubectl
since it's installed on the nodes (or at least the master node). And then just reboot the node, possibly draining it first if you want to be diligent, etc.
Ok great. I will try this. Do you know if there are any plan to add 1.5 to ACS any soon to replace 1.4.6 ?
We're in progress on it, but are holding off until after New Year, due to deployment "no fly zones" for the holidays.
Great. I made the update on my master node to 1.5.1, it went flawlessly. I will automate it with a fabric script for the time being. thanks for the support !
Hm. After updating the file /systemd/system/kubelet.service
to use 1.5.1 and rebooted the node, the server is still in 1.4.6:
kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-14T00:57:05Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:16:27Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
@AlexGrs Oh my mistake, you basically just upgraded kubelet
on the master, but not the static pod manifests that kubelet
runs (which includes apiserver
).
You'll need to make the same 1.4.6->1.5.1 replacement inside the files in /etc/kubernetes/manifests/
. You'll likely want to check the static pod manifests in /etc/kubernetes/addons/
as well, kube-proxy
for example, ought to be bumped up as well.
@AlexGrs .. I just went through a similar upgrade. After the systemd kubelet, grep -R v1.4 /etc/kubernetes
usually helps see where changes are needed .. Then sed is your friend .. a la sed -i -e "s@v1.4.6@v1.5.1@g"
.. Node should be drained first. Masters should just be rebooted. Good luck!
Just made my cluster work by using the https://github.com/Azure/acs-engine and hacking the version 1.5.1 into the generated template just before deployment. Finally, persistent volumes are working flawlessly.
@colemickens Any updates on kubernetes 1.5.1 as default version for deployments via Azure Container Service?
@phimar I did the same.
Finally, persistent volumes are working flawlessly.
Really? I haven't tested in 1.5.1 yet. I'm going to take a look again.
Cheers.
@otaviosoares Yep, it's working. It is as simple as configuring a storage class for azureDisk and a persistent volume claim for your deployment. The vhds are created and mounted to the correct agent automatically.
@phimar Yep but still there's that problem with Disks over 30gb.
@phimar We're working on getting it rolled out. No firm ETA beyond end of January, but it should be sooner.
@theobolo what problem?
@colemickens mkfs.ext4 that's taking hours when kubelet try to format a new PersistentVolume.
https://github.com/kubernetes/kubernetes/pull/38865 https://github.com/kubernetes/kubernetes/issues/30752
If i want to mount a 500gb PersistentVolume ... it's taking something like 1 hour.
I'm planning to move our application to kubernetes on azure. Regarding the issue you mentioned @theobolo : does it mean that if I add a 128Go persistent disk, my deployment will forever to finish ? At least the first time ? Do you have a workaround for this?
Preformat the disk and you'll avoid that issue. There are patches in flight to tweak the flags to mkfs to try to avoid the issue entirely.
Also, if you just use the dynamic disk provisioning feature, you won't hit this issue, which is much easier to do than manually creating the VHD anyway.
@colemickens I'm using the Dynamic Disk Provisionning feature with a PVC and an "Azure" StorageClass
using a Premium Storage; tried with a Classic POD and with a StatefulSet using PVCTemplates.
But still if i decide to Claim a 500gb volume it's taking more than 1 hour (i'm actually validating my words by deploying a Mongo StatefulSet with a 500gb volumes per instances using my AzureStorageClass)
Not seems that using the k8s Dynamic Provisioning is solving that issue since kubelet
is still trying to format any new empty disks just provisioned, that's why i'm saying that.
kube-controller logs when mounting the first instance with the disk :
2017-01-09T10:48:20.820294292Z I0109 10:48:20.819950 1 reconciler.go:202] Started AttachVolume for volume "kubernetes.io/azure-disk/coursier-preprod-dynamic-pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18.vhd" to node "k8s-agentpool1-35197013-3"
2017-01-09T10:48:20.938835753Z I0109 10:48:20.938470 1 operation_executor.go:620] AttachVolume.Attach succeeded for volume "kubernetes.io/azure-disk/coursier-preprod-dynamic-pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18.vhd" (spec.Name: "pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18") from node "k8s-agentpool1-35197013-3".
2017-01-09T10:48:21.030661411Z I0109 10:48:21.030141 1 node_status_updater.go:135] Updating status for node "k8s-agentpool1-35197013-3" succeeded. patchBytes: "{\"status\":{\"volumesAttached\":[{\"devicePath\":\"1\",\"name\":\"kubernetes.io/azure-disk/coursier-preprod-dynamic-pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18.vhd\"}]}}" VolumesAttached: [{kubernetes.io/azure-disk/coursier-preprod-dynamic-pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18.vhd 1}]
2017-01-09T10:48:35.378882802Z I0109 10:48:35.378509 1 pet_set.go:324] Syncing StatefulSet default/mongo with 1 pods
2017-01-09T10:48:35.380575953Z I0109 10:48:35.380299 1 pet_set.go:332] StatefulSet mongo blocked from scaling on pod mongo-0
and the related kubelet logs after 1hour :
E0109 11:51:47.493407 4592 mount_linux.go:391] Could not determine if disk "" is formatted (exit status 1)
E0109 11:51:47.493677 4592 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/azure-disk/coursier-preprod-dynamic-pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18.vhd\"" failed. No retries permitted until 2017-01-09 11:53:47.49364695 +0000 UTC (durationBeforeRetry 2m0s). Error: MountVolume.MountDevice failed for volume "kubernetes.io/azure-disk/coursier-preprod-dynamic-pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18.vhd" (spec.Name: "pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18") pod "9d10dd78-d658-11e6-a3e7-000d3ab4db18" (UID: "9d10dd78-d658-11e6-a3e7-000d3ab4db18") with: mount failed: exit status 1
Mounting command: mount
Mounting arguments: /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/coursier-preprod-dynamic-pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18.vhd ext4 [defaults]
Output: mount: can't find /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/coursier-preprod-dynamic-pvc-9d1069a0-d658-11e6-a3e7-000d3ab4db18.vhd in /etc/fstab
And the Kubernetes Dashboard waiting the disk :
And the PersistentVolumes :
After 1 hour i'm still waiting my 1st Mongo Pod .... I'm missing something ?
Sorry, @theobolo you're correct, I got wires crossed. I pinged the relevant PR again last night to try to push it along. I'll escalate it in the next two days if it doesn't pick up momentum.
@colemickens Merci Cole ;)
When your PR is merged, will it be available in the next Kubernetes Release ? I'll try to hack around acs-engine to deploy the cluster to stay as up-to-date as possible with latest improvements on kubernetes for azure
It's not my PR, but yes, that's generally how it works. You can always build your own release and deploy it with ACS-Engine. Since ACS-Engine does everything with the hyperkube image, you merely need to build it yourself. My dev cycle is usually:
export REGISTRY=docker.io/colemickens
export VERSION=some-version
./hack/dev-push-hyperkube.sh
And then after it's done, I can use docker.io/colemickens/hyperkube-amd64:some-version
as the hyperkubeSpec with the ACS-Engine output to run my custom build.
Hey @colemickens : you mentioned using dynamic disk provisioning
.
I tried to find the relevant documentation about it in the documentation and found only this link
In this example, the DiskURI is mandatory. But with dynamic claim, it should be automatically created right ? Or am I missing something ?
There is dynamic disk provisioning just like in GCE or AWS. I think documentation is absent. https://github.com/kubernetes/kubernetes/pull/30091
should be available now by https://github.com/kubernetes/kubernetes.github.io/pull/2039
I seem to hit a timeout issue when trying to mount a persistent volume :
I checked and a VHD is indeed created in my storage account. I tried to check logs in the controller, and the mount on host seems working. Here are some logs from kube-controller:
2017-01-12T14:25:54.311786013Z I0112 14:25:54.311577 1 replication_controller.go:322] Observed updated replication controller postgresql. Desired pod count change: 1->1
2017-01-12T14:25:54.341387770Z I0112 14:25:54.341218 1 replication_controller.go:322] Observed updated replication controller postgresql. Desired pod count change: 1->1
2017-01-12T14:25:59.248709444Z I0112 14:25:59.248391 1 operation_executor.go:700] DetachVolume.Detach succeeded for volume "kubernetes.io/azure-disk/kapptivatekuber-dynamic-pvc-e005ea48-d8d1-11e6-a869-000d3a34f8f1.vhd" (spec.Name: "pvc-e005ea48-d8d1-11e6-a869-000d3a34f8f1") from node "k8s-agentpool-17601863-0".
2017-01-12T14:25:59.257773962Z I0112 14:25:59.257342 1 reconciler.go:202] Started AttachVolume for volume "kubernetes.io/azure-disk/kapptivatekuber-dynamic-pvc-e005ea48-d8d1-11e6-a869-000d3a34f8f1.vhd" to node "k8s-agentpool-17601863-0"
2017-01-12T14:27:59.776711601Z I0112 14:27:59.776514 1 operation_executor.go:620] AttachVolume.Attach succeeded for volume "kubernetes.io/azure-disk/kapptivatekuber-dynamic-pvc-e005ea48-d8d1-11e6-a869-000d3a34f8f1.vhd" (spec.Name: "pvc-e005ea48-d8d1-11e6-a869-000d3a34f8f1") from node "k8s-agentpool-17601863-0".
2017-01-12T14:27:59.875333383Z I0112 14:27:59.875169 1 node_status_updater.go:135] Updating status for node "k8s-agentpool-17601863-0" succeeded. patchBytes: "{\"status\":{\"volumesAttached\":[{\"devicePath\":\"0\",\"name\":\"kubernetes.io/azure-disk/kapptivatekuber-dynamic-pvc-e005ea48-d8d1-11e6-a869-000d3a34f8f1.vhd\"}]}}" VolumesAttached: [{kubernetes.io/azure-disk/kapptivatekuber-dynamic-pvc-e005ea48-d8d1-11e6-a869-000d3a34f8f1.vhd 0}]
2017-01-12T14:32:15.637253775Z W0112 14:32:15.637047 1 reflector.go:319] pkg/controller/garbagecollector/garbagecollector.go:760: watch of <nil> ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [143998/143434]) [144997]
2017-01-12T14:33:44.029235618Z I0112 14:33:44.029030 1 replication_controller.go:541] Too few "default"/"postgresql" replicas, need 1, creating 1
2017-01-12T14:33:44.047561542Z I0112 14:33:44.047416 1 event.go:217] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"default", Name:"postgresql", UID:"06a7037c-d8d3-11e6-a869-000d3a34f8f1", APIVersion:"v1", ResourceVersion:"144311", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: postgresql-tj1z6
2017-01-12T14:33:44.081956887Z I0112 14:33:44.081868 1 replication_controller.go:322] Observed updated replication controller postgresql. Desired pod count change: 1->1
2017-01-12T14:33:44.111482326Z I0112 14:33:44.111365 1 replication_controller.go:322] Observed updated replication controller postgresql. Desired pod count change: 1->1
@AlexGrs do you have kubelet log on that host?
@rootfs : For the host where the pod is located ? I will check how to ssh on it to check logs.
@AlexGrs yes, from the host where pod lands.
@AlexGrs Witch is the size of the PersistentVolume that you wanted to deploy ?
Because as i said if you try to provision a Disk bigger than 30gb it can takes a long time before the POD mount it correctly.
Even if the kube-controller says that the Volume is mounted that's not mean that it's formatted : that's why the error is still here and why your POD is not available,
For exemple if i want to deploy a 30gb disk on a Premium Storage used by a Jenkins POD. The first time it takes something like 15-20min because Kubelet need to do a mkfs.ext4 on that disk before the POD starts. That's why you have that error.
Just wait a little bit or try with a smaller disk ;)
I managed to connect to my agent running this pod. During this time ( nearly 20 minutes), it seems the pod managed to finally mount the disk. I delete the rc
and created it again and it worked after 1 or 2 minutes.
The size of the disk is 30Gi
. My guess is that it was performing some kind of formatting / verification on my volume for the first time and then don't need to do it the next time. It may be related to @theobolo issue. The more capacity there is for the disk, the more time it will take to mount in a pod.
@AlexGrs That's exactly the point :
Ah ! That's why. I saw there is a on-going PR for a lazy
verification. Can't wait to have this available as we have some hudge database. Is there any workaround for this waiting for the PR ?
@AlexGrs the only workaround today is as @colemickens said, formatting your Disk manually (you can use his guide https://github.com/colemickens/azure-kubernetes-demo).
I did that to format two 500gb disks used by Jenkins and Nexus Repo, when the disks are preformatted, kubelet won't try to format it and will mount disks in 2-3min maximum.
That's the only way to use big PersistentDisks on Azure (and to reach the P30 perfs on Premium Storage, since disk perfs are indexed on the disk size in Azure).
Last thing, to mount your VHD once is formatted, you can use that :
volumes:
- azureDisk:
diskURI: https://diskurlwithvhd
diskName: data-master
name: some-disk
@AlexGrs A workaround/hack is to use this script: https://gist.github.com/codablock/9b8c3a09b6f725436143da575d23ca45
It is a wrapper script around mkfs.ext4 and removes all lazy init related flags from the mkfs.ext4 call.
to use it:
$ mv /usr/sbin/mkfs.ext4 /usr/sbin/mkfs.ext4.original
$ wget -O /usr/sbin/mkfs.ext4 https://gist.githubusercontent.com/codablock/9b8c3a09b6f725436143da575d23ca45/raw/ed6e604ec71c2230e889b625b85d2986d0e6eb18/mkfs.ext4%2520lazy%2520init%2520hack
$ chmod +x /usr/sbin/mkfs.ext4
I deploy this with Ansible (kargo) right now. It assumes that the hosts mkfs.ext4 is used. I'm not sure how acs-engine deploys kubelet, but I'd expect it to be deployed as regular service and not as containarized kubelet. If this is not the case, the script would have to be put into the hyperkube image (making things complicated).
@codablock unfortunately kubelet
is deployed using Hyperkube image in ACS-engine.
@theobolo : I think you did not finish your answer ;)
@codablock : kubelet is running in a container with ACS engine
Is kubelet run with nsenter on acs-engine?
Just seen the screenshot from theobolo and it looks like it is not run with nsenter. This would mean that you'd somehow must modify the hyperkube image to make the wrapper work. If acs-engine supports specifying a custom hyperkube image, that could be done by extending from the original image, installing the script in it, pushing it to docker hub and then use the custom/modified image.
@codablock That's possible to use a custom Hyperkube image, seems heavy but it should works ...
It seems it's the approach @colemickens described in his previous post.
@AlexGrs What he describes is a complete rebuild of kubernetes. This would only be required if you would like to do the changes directly in the k8s source tree or if you would like to build and use current master.
EDIT: If you want to do this, then it's better to create a branch based on 1.5.2 and merge in https://github.com/kubernetes/kubernetes/pull/38865 instead of using this hack
@codablock : I will maybe try this waiting for your PR to be merge in master branch.
I tried with a Premium_LRS
volume now instead of a Standard_LRS
one but even after 30 minutes for a 30Gi
disk, it fails to mount.
0112 18:02:42.875630 4473 operation_executor.go:832] MountVolume.WaitForAttach succeeded for volume "kubernetes.io/azure-disk/xxx-dynamic-pvc-4e49581b-d8e5-11e6-a869-000d3a34f8f1.vhd" (spec.Name: "pvc-4e49581b-d8e5-11e6-a869-000d3a34f8f1") pod "2ed411fa-d8e9-11e6-a869-000d3a34f8f1" (UID: "2ed411fa-d8e9-11e6-a869-000d3a34f8f1").
E0112 18:02:42.880996 4473 mount_linux.go:119] Mount failed: exit status 1
Mounting command: mount
Mounting arguments: /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/xxx-dynamic-pvc-4e49581b-d8e5-11e6-a869-000d3a34f8f1.vhd ext4 [defaults]
Output: mount: can't find /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/xxx-dynamic-pvc-4e49581b-d8e5-11e6-a869-000d3a34f8f1.vhd in /etc/fstab
E0112 18:02:42.883785 4473 mount_linux.go:391] Could not determine if disk "" is formatted (exit status 1)
E0112 18:02:42.884975 4473 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/azure-disk/xxx-dynamic-pvc-4e49581b-d8e5-11e6-a869-000d3a34f8f1.vhd\"" failed. No retries permitted until 2017-01-12 18:04:42.884230675 +0000 UTC (durationBeforeRetry 2m0s). Error: MountVolume.MountDevice failed for volume "kubernetes.io/azure-disk/xxx-dynamic-pvc-4e49581b-d8e5-11e6-a869-000d3a34f8f1.vhd" (spec.Name: "pvc-4e49581b-d8e5-11e6-a869-000d3a34f8f1") pod "2ed411fa-d8e9-11e6-a869-000d3a34f8f1" (UID: "2ed411fa-d8e9-11e6-a869-000d3a34f8f1") with: mount failed: exit status 1
Mounting command: mount
Mounting arguments: /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/xxx-dynamic-pvc-4e49581b-d8e5-11e6-a869-000d3a34f8f1.vhd ext4 [defaults]
Output: mount: can't find /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/xxx-dynamic-pvc-4e49581b-d8e5-11e6-a869-000d3a34f8f1.vhd in /etc/fstab
@AlexGrs The PR is merged now. I would however wait for https://github.com/kubernetes/kubernetes/pull/40066 to be merged as well.
I created Kubernetes cluster using Azure portal and it installed 1.5.3 by default. There is no way to upgrade the cluster easily. In order to have up-to-date Kubernetes clusters, we need this feature.
Upgrading in Azure from 1.5.3 to 1.5.7 is simple, Anybody here that have successfully upgraded from 1.5.X to 1.6.X in Azure. I have a problem with the master not starting the kubelet.service correctly if I try this approach. It seems that the --config=/etc/kubernetes/manifests \
parameter in the startup was removed
[Unit]
Description=Kubelet
Requires=docker.service
After=docker.service
[Service]
Restart=always
ExecStartPre=/bin/mkdir -p /var/lib/kubelet
# Azure does not support two LoadBalancers(LB) sharing the same nic and backend port.
# As a workaround, the Internal LB(ILB) listens for apiserver traffic on port 4443 and the External LB(ELB) on port 443
# This IPTable rule then redirects ILB traffic to port 443 in the prerouting chain
ExecStartPre=/bin/bash -c "iptables -t nat -A PREROUTING -p tcp --dport 4443 -j REDIRECT --to-port 443"
ExecStartPre=/bin/sed -i "s|<kubernetesHyperkubeSpec>|gcr.io/google_containers/hyperkube-amd64:v1.6.2|g" "/etc/kubernetes/addons/kube-proxy-daemonset.yaml"
ExecStartPre=/bin/mount --bind /var/lib/kubelet /var/lib/kubelet
ExecStartPre=/bin/mount --make-shared /var/lib/kubelet
ExecStart=/usr/bin/docker run \
--name=kubelet \
--net=host \
--pid=host \
--privileged \
--volume=/dev:/dev \
--volume=/sys:/sys:ro \
--volume=/var/run:/var/run:rw \
--volume=/var/lib/docker/:/var/lib/docker:rw \
--volume=/var/lib/kubelet/:/var/lib/kubelet:shared \
--volume=/var/log:/var/log:rw \
--volume=/etc/kubernetes/:/etc/kubernetes:ro \
--volume=/srv/kubernetes/:/srv/kubernetes:ro \
gcr.io/google_containers/hyperkube-amd64:v1.5.7 \
/hyperkube kubelet \
--api-servers="https://10.240.255.5:443" \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--address=0.0.0.0 \
--allow-privileged=true \
--enable-server \
--enable-debugging-handlers \
--config=/etc/kubernetes/manifests \
--cluster-dns=10.0.0.10 \
--cluster-domain=cluster.local \
--register-schedulable=false \
--cloud-provider=azure \
--cloud-config=/etc/kubernetes/azure.json \
--hairpin-mode=promiscuous-bridge \
--network-plugin=kubenet \
--azure-container-registry-config=/etc/kubernetes/azure.json \
--v=2
ExecStop=/usr/bin/docker stop -t 10 kubelet
ExecStopPost=/usr/bin/docker rm -f kubelet
[Install]
WantedBy=multi-user.target
As it seems that the kubelet is started as a Docker container If I remove the parameter, the scheduler and api-server doesn't start
Now that the recommended way of deploying a cluster is by using ACS, is there a recommended way to upgrade an existing kubernetes cluster ?
Right now, all the cluster I deployed were in 1.4.6 but I would like to benefit from the great work you made with 1.5 in azure.