Closed ereslibre closed 6 years ago
@flavio, @inercia: Still testing the patch, but until now it seems to work as expected. It's a WIP, but I'm asking for your opinion on the migrations
folder structure mainly: the idea is to stop poisoning the main salt states with transient logic for migrations, and instead, have specific folders for this, easy to drop in the near future, and easy to search and remove for all references on orchestrations...
More context: I'm keeping the original update-(pre|post)-*
steps on their original folders that apply to all updates (e.g. kubelet.update-post-start-services
) regardless of specific migrations (in this case we always want to uncordon the node when it comes back and we cordoned it).
What goes in the migrations/v1-v2
folder are specific migration logic that can be removed on future versions. I found far easier to reason with this structure when checking the whole update logic.
The upgrade went very smoothly. At all times I could see Ready
nodes (the old ones -- able to contact the new masters, and the new ones when they were updated one by one):
NAME STATUS ROLES AGE VERSION
59a524dffa4d4189bdac5a5ce10ed640.infra.caasp.local NotReady,SchedulingDisabled <none> 54m v1.8.9
90669b90cfd04db5826da66e3ef0dad7.infra.caasp.local Ready <none> 53m v1.8.9
b63f313481e74f92b8ff0aa66ed9f19b.infra.caasp.local NotReady,SchedulingDisabled <none> 53m v1.8.9
d130787c801d4c4f8d9d851ac11a5f41.infra.caasp.local Ready <none> 53m v1.8.9
e58fd682b8a4471db237e0fc7992e2f4.infra.caasp.local Ready <none> 53m v1.8.9
master-0 Ready,SchedulingDisabled <none> 8m v1.9.8
worker-0 NotReady,SchedulingDisabled <none> 8m v1.9.8
worker-1 NotReady <none> 8m
worker-2 NotReady <none> 8m
worker-3 NotReady <none> 8m
worker-4 Ready <none> 8m v1.9.8
NAME STATUS ROLES AGE VERSION
59a524dffa4d4189bdac5a5ce10ed640.infra.caasp.local NotReady,SchedulingDisabled <none> 57m v1.8.9
d130787c801d4c4f8d9d851ac11a5f41.infra.caasp.local Ready <none> 56m v1.8.9
e58fd682b8a4471db237e0fc7992e2f4.infra.caasp.local Ready,SchedulingDisabled <none> 56m v1.8.9
master-0 Ready,SchedulingDisabled <none> 12m v1.9.8
worker-0 Ready <none> 12m v1.9.8
worker-1 NotReady <none> 12m
worker-2 Ready <none> 12m v1.9.8
worker-3 NotReady <none> 12m
worker-4 Ready <none> 12m v1.9.8
However, when all machines were done and we were waiting for deployments, the master
never got removed from the list and never was uncordoned. I'm looking at this at the moment; when this gets fixed I think the upgrade process should be stable:
NAME STATUS ROLES AGE VERSION
59a524dffa4d4189bdac5a5ce10ed640.infra.caasp.local NotReady,SchedulingDisabled <none> 1h v1.8.9
master-0 Ready,SchedulingDisabled master 20m v1.9.8
worker-0 Ready <none> 20m v1.9.8
worker-1 Ready <none> 20m v1.9.8
worker-2 Ready <none> 20m v1.9.8
worker-3 Ready <none> 20m v1.9.8
worker-4 Ready <none> 20m v1.9.8
Updated the patch to also address the issue commented on https://github.com/kubic-project/salt/pull/626#issuecomment-403012348
@inercia can you take a look please?
I'm still testing. I think it's worth handing this over to QA once I have finished my tests so they can confirm that problems are gone and after that we can merge if you are fine with the change.
The migration procedure is working really great now. Result:
admin:~ # kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master-0 Ready master 27m v1.9.8 <none> SUSE CaaS Platform 3.0 4.4.132-94.33-default docker://17.9.1
worker-0 Ready <none> 27m v1.9.8 <none> SUSE CaaS Platform 3.0 4.4.132-94.33-default docker://17.9.1
worker-1 Ready <none> 27m v1.9.8 <none> SUSE CaaS Platform 3.0 4.4.132-94.33-default docker://17.9.1
worker-2 Ready <none> 27m v1.9.8 <none> SUSE CaaS Platform 3.0 4.4.132-94.33-default docker://17.9.1
worker-3 Ready <none> 27m v1.9.8 <none> SUSE CaaS Platform 3.0 4.4.132-94.33-default docker://17.9.1
worker-4 Ready <none> 27m v1.9.8 <none> SUSE CaaS Platform 3.0 4.4.132-94.33-default docker://17.9.1
At all times we had all Ready
nodes except for the one being updated. Now we have to check how the workers behave with active ceph mounts.
worked with @ereslibre and draining works (tested on v3). For v2 to v3 upgrade the nodes were notReady so draining was not working