SUSE / caasp-salt

A collection of salt states used to provision a kubernetes cluster
Apache License 2.0
64 stars 29 forks source link

Fix update issues #626

Closed ereslibre closed 6 years ago

ereslibre commented 6 years ago
ereslibre commented 6 years ago

@flavio, @inercia: Still testing the patch, but until now it seems to work as expected. It's a WIP, but I'm asking for your opinion on the migrations folder structure mainly: the idea is to stop poisoning the main salt states with transient logic for migrations, and instead, have specific folders for this, easy to drop in the near future, and easy to search and remove for all references on orchestrations...

ereslibre commented 6 years ago

More context: I'm keeping the original update-(pre|post)-* steps on their original folders that apply to all updates (e.g. kubelet.update-post-start-services) regardless of specific migrations (in this case we always want to uncordon the node when it comes back and we cordoned it).

What goes in the migrations/v1-v2 folder are specific migration logic that can be removed on future versions. I found far easier to reason with this structure when checking the whole update logic.

ereslibre commented 6 years ago

The upgrade went very smoothly. At all times I could see Ready nodes (the old ones -- able to contact the new masters, and the new ones when they were updated one by one):

NAME                                                 STATUS                        ROLES     AGE       VERSION
59a524dffa4d4189bdac5a5ce10ed640.infra.caasp.local   NotReady,SchedulingDisabled   <none>    54m       v1.8.9
90669b90cfd04db5826da66e3ef0dad7.infra.caasp.local   Ready                         <none>    53m       v1.8.9
b63f313481e74f92b8ff0aa66ed9f19b.infra.caasp.local   NotReady,SchedulingDisabled   <none>    53m       v1.8.9
d130787c801d4c4f8d9d851ac11a5f41.infra.caasp.local   Ready                         <none>    53m       v1.8.9
e58fd682b8a4471db237e0fc7992e2f4.infra.caasp.local   Ready                         <none>    53m       v1.8.9
master-0                                             Ready,SchedulingDisabled      <none>    8m        v1.9.8
worker-0                                             NotReady,SchedulingDisabled   <none>    8m        v1.9.8
worker-1                                             NotReady                      <none>    8m
worker-2                                             NotReady                      <none>    8m
worker-3                                             NotReady                      <none>    8m
worker-4                                             Ready                         <none>    8m        v1.9.8
NAME                                                 STATUS                        ROLES     AGE       VERSION
59a524dffa4d4189bdac5a5ce10ed640.infra.caasp.local   NotReady,SchedulingDisabled   <none>    57m       v1.8.9
d130787c801d4c4f8d9d851ac11a5f41.infra.caasp.local   Ready                         <none>    56m       v1.8.9
e58fd682b8a4471db237e0fc7992e2f4.infra.caasp.local   Ready,SchedulingDisabled      <none>    56m       v1.8.9
master-0                                             Ready,SchedulingDisabled      <none>    12m       v1.9.8
worker-0                                             Ready                         <none>    12m       v1.9.8
worker-1                                             NotReady                      <none>    12m
worker-2                                             Ready                         <none>    12m       v1.9.8
worker-3                                             NotReady                      <none>    12m
worker-4                                             Ready                         <none>    12m       v1.9.8

However, when all machines were done and we were waiting for deployments, the master never got removed from the list and never was uncordoned. I'm looking at this at the moment; when this gets fixed I think the upgrade process should be stable:

NAME                                                 STATUS                        ROLES     AGE       VERSION
59a524dffa4d4189bdac5a5ce10ed640.infra.caasp.local   NotReady,SchedulingDisabled   <none>    1h        v1.8.9
master-0                                             Ready,SchedulingDisabled      master    20m       v1.9.8
worker-0                                             Ready                         <none>    20m       v1.9.8
worker-1                                             Ready                         <none>    20m       v1.9.8
worker-2                                             Ready                         <none>    20m       v1.9.8
worker-3                                             Ready                         <none>    20m       v1.9.8
worker-4                                             Ready                         <none>    20m       v1.9.8
ereslibre commented 6 years ago

Updated the patch to also address the issue commented on https://github.com/kubic-project/salt/pull/626#issuecomment-403012348

flavio commented 6 years ago

@inercia can you take a look please?

ereslibre commented 6 years ago

I'm still testing. I think it's worth handing this over to QA once I have finished my tests so they can confirm that problems are gone and after that we can merge if you are fine with the change.

ereslibre commented 6 years ago

The migration procedure is working really great now. Result:

admin:~ # kubectl get nodes -o wide
NAME       STATUS    ROLES     AGE       VERSION   EXTERNAL-IP   OS-IMAGE                 KERNEL-VERSION          CONTAINER-RUNTIME
master-0   Ready     master    27m       v1.9.8    <none>        SUSE CaaS Platform 3.0   4.4.132-94.33-default   docker://17.9.1
worker-0   Ready     <none>    27m       v1.9.8    <none>        SUSE CaaS Platform 3.0   4.4.132-94.33-default   docker://17.9.1
worker-1   Ready     <none>    27m       v1.9.8    <none>        SUSE CaaS Platform 3.0   4.4.132-94.33-default   docker://17.9.1
worker-2   Ready     <none>    27m       v1.9.8    <none>        SUSE CaaS Platform 3.0   4.4.132-94.33-default   docker://17.9.1
worker-3   Ready     <none>    27m       v1.9.8    <none>        SUSE CaaS Platform 3.0   4.4.132-94.33-default   docker://17.9.1
worker-4   Ready     <none>    27m       v1.9.8    <none>        SUSE CaaS Platform 3.0   4.4.132-94.33-default   docker://17.9.1

At all times we had all Ready nodes except for the one being updated. Now we have to check how the workers behave with active ceph mounts.

ellisab commented 6 years ago

worked with @ereslibre and draining works (tested on v3). For v2 to v3 upgrade the nodes were notReady so draining was not working