Redeployment on top of newer Kubernetes version

zonca commented 4 years ago

The Jetstream team is working on a newer Kubernetes environment, within 1 or 2 weeks they will notify me about availability and I will tear down the deployment and rebuild it again on top of the new environment.

[x] save the data volume so it can be reattached to the newer environment (I'll try my best, there is a small possibility I will loose data and they will need to be copied again)
[x] tear down the old deployment
[x] deploy kubernetes and check it is working correctly (especially logging which stopped working a month ago)
[x] deploy the CVMFS / NFS service re-mounting the data volume
[x] make sure the networking issue (#10) is solved
[x] deploy JupyterHub
[x] update the documentation if anything changed

zonca commented 4 years ago

@pibion anything else we need to save?

pibion commented 4 years ago

@zonca nope, that's everything. People working on the XSEDE instance have been warned that they should treat the disk space as volatile. I'll send out my usual "hey does anyone need help putting their work into a git repository" message.

zonca commented 4 years ago

ok, I'll send out an advance warning when I am ready for it

zonca commented 4 years ago

still waiting for the new images

zonca commented 4 years ago

we got the new image:

ae275170-b48c-4104-8af1-4d271f33a43c | Fedora-Atomic-29

and also Magnum was updated to the Openstack Train release.

@pibion, is it ok if I tear down the deployment in the next days? depending on the complexity of the upgrade, it might take 2 or 3 weeks to put it back online. But once that is done, we will have a newer version of Kubernetes, which should work better.

I will try to save the data volume and attach it back to the new deployment.

zonca commented 4 years ago

this updates Kubernetes from 1.11 released summer 2018 to 1.15 released summer 2019, this is good because some plugins are dropping 1.11 support.

zonca commented 4 years ago

accessing logs is now working fine

zonca commented 4 years ago

Volume mounting does not work, the problem is that nodes are in different zones, for example:

failure-domain.beta.kubernetes.io/region=RegionOne
failure-domain.beta.kubernetes.io/zone=zone-r2

than volumes:

Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [nova]
                   failure-domain.beta.kubernetes.io/region in [RegionOne]

The initial error is:

1 node(s) had volume node affinity conflict.

In the past I solved modifying the Kubernetes scheduler policy: https://github.com/zonca/magnum/pull/1

This time it makes the pod schedulable, but I still get:

Warning FailedMount 109s (x5 over 10m) kubelet, k8s-atomic-29-5seb7mpl735s-node-0 Unable to mount volumes for pod "alpine_default(230b9a94-69ce-4ded-896b-aa31c279449f)": timeout expired waiting for volumes to attach or mount for pod "default"/"alpine". list of unmounted volumes=[alpine-volume]. list of unattached volumes=[alpine-volume default-token-5rl65]
Warning FailedMount 20s (x14 over 12m) kubelet, k8s-atomic-29-5seb7mpl735s-node-0 MountVolume.NodeAffinity check failed for volume "pvc-d1b632b5-4abc-4341-9334-9dbb82e06b73" : No matching NodeSelectorTerms

zonca commented 4 years ago

fixed this issue by manually editing the node metadata and modifying the zone to be nova, contacted the Jetstream support about this.

zonca commented 4 years ago

changing the node metadata to "nova" also fixes the volume mounting issue even without applying the fix to the kubernetes scheduler.

zonca commented 4 years ago

@pibion @thathayhaykid ok, I've done a lot of tests on a different cluster, everything seems to be working (better) with the new Kubernetes version. So I will proceed to teardown the JupyterHub instance and redeploy it, I will preserve the data volume (unless something unexpected happens).

I will start the process next Thursday May 21st, if anyone wants to delay the process let me know.

pibion commented 4 years ago

@zonca no need to delay, Thursday May 21st sounds good!

zonca commented 4 years ago

ok @pibion thanks

zonca commented 4 years ago

Another way to solve the zone label issue would be to disable it, we can do for PV: https://docs.openshift.com/container-platform/3.10/install_config/configuring_openstack.html

but how to do it for nodes?

Next I'll try to do the fix above for PV together with the scheduler fix and hope together they can solve the issue. Otherwise I'll try again looking into the Magnum templates.

zonca commented 4 years ago

I think this worked: https://github.com/zonca/magnum/pull/2/files, will ask the Jetstream team to implement

zonca commented 4 years ago

ok, started working in this transition.

the ID of the 500 GB data volume, for later reference, is 7681ccc9-7bdb-470f-bd6d-43587e2c2328

/dev/sdf                   492G   93G  399G  19% /cvmfs/data

zonca commented 4 years ago

ok, @pibion @ziqinghong, the transition is complete, now I have reactivated persistent storage again. I updated the README at https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream to only have documentation for new users.

I moved the information about the deployment to https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/blob/master/DEPLOY.md

I also created new documentation on how to redeploy, so that it is easier next time: https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/blob/master/REDEPLOY.md

If you find any problem, please open a new issue.

det-lab / jupyterhub-deploy-kubernetes-jetstream

Redeployment on top of newer Kubernetes version #19