Closed zonca closed 4 years ago
@pibion anything else we need to save?
@zonca nope, that's everything. People working on the XSEDE instance have been warned that they should treat the disk space as volatile. I'll send out my usual "hey does anyone need help putting their work into a git repository" message.
ok, I'll send out an advance warning when I am ready for it
still waiting for the new images
we got the new image:
ae275170-b48c-4104-8af1-4d271f33a43c | Fedora-Atomic-29
and also Magnum was updated to the Openstack Train release.
@pibion, is it ok if I tear down the deployment in the next days? depending on the complexity of the upgrade, it might take 2 or 3 weeks to put it back online. But once that is done, we will have a newer version of Kubernetes, which should work better.
I will try to save the data
volume and attach it back to the new deployment.
this updates Kubernetes from 1.11 released summer 2018 to 1.15 released summer 2019, this is good because some plugins are dropping 1.11 support.
accessing logs is now working fine
Volume mounting does not work, the problem is that nodes are in different zones, for example:
failure-domain.beta.kubernetes.io/region=RegionOne
failure-domain.beta.kubernetes.io/zone=zone-r2
than volumes:
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/zone in [nova]
failure-domain.beta.kubernetes.io/region in [RegionOne]
The initial error is:
1 node(s) had volume node affinity conflict.
In the past I solved modifying the Kubernetes scheduler policy: https://github.com/zonca/magnum/pull/1
This time it makes the pod schedulable, but I still get:
fixed this issue by manually editing the node metadata and modifying the zone to be nova
, contacted the Jetstream support about this.
changing the node metadata to "nova" also fixes the volume mounting issue even without applying the fix to the kubernetes scheduler.
@pibion @thathayhaykid ok, I've done a lot of tests on a different cluster, everything seems to be working (better) with the new Kubernetes version.
So I will proceed to teardown the JupyterHub instance and redeploy it, I will preserve the data
volume (unless something unexpected happens).
I will start the process next Thursday May 21st, if anyone wants to delay the process let me know.
@zonca no need to delay, Thursday May 21st sounds good!
ok @pibion thanks
Another way to solve the zone label issue would be to disable it, we can do for PV: https://docs.openshift.com/container-platform/3.10/install_config/configuring_openstack.html
but how to do it for nodes?
Next I'll try to do the fix above for PV together with the scheduler fix and hope together they can solve the issue. Otherwise I'll try again looking into the Magnum templates.
I think this worked: https://github.com/zonca/magnum/pull/2/files, will ask the Jetstream team to implement
ok, started working in this transition.
the ID of the 500 GB data volume, for later reference, is 7681ccc9-7bdb-470f-bd6d-43587e2c2328
/dev/sdf 492G 93G 399G 19% /cvmfs/data
ok, @pibion @ziqinghong, the transition is complete, now I have reactivated persistent storage again. I updated the README at https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream to only have documentation for new users.
I moved the information about the deployment to https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/blob/master/DEPLOY.md
I also created new documentation on how to redeploy, so that it is easier next time: https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/blob/master/REDEPLOY.md
If you find any problem, please open a new issue.
The Jetstream team is working on a newer Kubernetes environment, within 1 or 2 weeks they will notify me about availability and I will tear down the deployment and rebuild it again on top of the new environment.