StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
69 stars 12 forks source link

One-time Update: Migrate users to long lived tag `V1` and patch ImagePullPolicy #1217

Closed Jose-Matsuda closed 2 years ago

Jose-Matsuda commented 2 years ago

Incredibly similar to https://github.com/StatCan/daaas/issues/976 (would just need to add remote desktop to the list to iterate through). We also want to patch the imagePullPolicy at the same time if possible.

Reasoning

This is to facilitate the running of a weekly cronjob that will restart user workloads iff their version of "v1" is "older" because their image digest will not match the most recent one (because say a push was triggered to aaw-kubeflow-containers master branch).

Concerns

Like in https://github.com/StatCan/daaas/issues/983 this script will run just fine, but with a lot of pods being rescheduled we may be bringing gatekeeper down to its knees again

Jose-Matsuda commented 2 years ago

Ideally I can just add the imagePullPolicy thing in this line

 kubectl patch Notebook $notebookname --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"k8scc01covidacr.azurecr.io/'"$i"':c5b7982c"}]' --namespace $namespace

Yup running this is fine (without the --dry-run it actually did the update and the statefulset updated accordingly), (note that I used Never here just to change it)

kubectl patch Notebook patchtest --dry-run=client --type='json' -p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"k8scc01covidacr.azurecr.io/jupyterlab-cpu:v1"},
{"op": "replace", "path": "/spec/template/spec/containers/0/imagePullPolicy", "value":"Never"}
]' --namespace jose-matsuda
Jose-Matsuda commented 2 years ago

Resolution Script

https://gist.github.com/Jose-Matsuda/61ec40d175fadfff045be5be481dc7b9

Ran successfully on dev, note that you need to remove the --dry-run bit if you want it to actually go. After every 20 workload patches there is a 5 second sleep, though I would imagine it could be a bit more (honestly could make it 10 seconds) as the time it takes for the pod to come back up is longer.

EDIT

Note that unfortunately I forgot to tell Souheil to also change the default imagePullPolicy on the spawner config so I had to also account for that in this PR. https://github.com/StatCan/aaw-kubeflow-manifests/pull/189/files#diff-364a9e35e63c4516d98101fc647536dd0e712382eabd4c41913e1806f571b731R51

So it is likely that you will need to change this to also just get every single image regardless of tag, in order to also change the imagePullPolicy

Jose-Matsuda commented 2 years ago

CC @chuckbelisle to action when he finds time to do so. You can copy paste the gist here but you will need to take out the --dry-run=client on line 29 to make sure you know you want to execute it.

There is also the sleep 7 on line 25. This is done every 20 kubectl restarts and we have around 400 notebooks that need to be restarted which is around 140 seconds of built in delay overall (which honestly may not be enough).

chuckbelisle commented 2 years ago

Changed sleep time to 15 seconds Executed patch script July 1st 2022 @ 10:15pm