StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
68 stars 12 forks source link

Update Manifest for Prod #1793

Closed wg102 closed 1 year ago

wg102 commented 1 year ago

Based of the ticket for dev https://github.com/StatCan/aaw/issues/1729 and the subsequent ticket of issues https://github.com/StatCan/aaw/issues/1752

Common

Component Local Manifests Path Upstream Issue AAW Sign-off CNS Sign-off Notes
kubeflow-namespace [common/kubeflow-namespace][local-kubeflow-namespace] v1.7.0 [#343][kubeflow-namespace]
kubeflow-roles [common/kubeflow-roles][local-kubeflow-roles] v1.7.0 [#345][kubeflow-roles]
oidc-authservice [common/oidc-authservice][local-oidc-authservice] v1.7.0 [#347][oidc-authservice]
kubeflow-knative [common/knative][local-knative] v1.7.0 [#349][knative]

I think anything that is not direct folder equivalent is in the knative folder

Apps

Component Local Manifests Path Upstream Issue AAW Sign-off CNS Sign-off Notes
admission-webhook [apps/admission-webhook][local-admission-webhook] v1.7.0 [#339][admission-webhook]
central-dashboard [apps/centraldashboard][local-centraldashboard] v1.7.0 [#340][central-dashboard]
jupyter-web-apps [apps/jupyter-web-app][local-jupyter-web-app] v1.7.0 [#351][jwa]
katib [apps/katib][local-katib] v1.7.0 [#353][katib]
notebook-controller [apps/notebook-controller][local-notebook-controller] v1.7.0 [#355][notebook-controller]
profiles [apps/profiles][local-profiles] v1.7.0 [#357][profiles]
training-operator [apps/training-operator][local-training-operator] v1.7.0 [#359][training-operator]

Contrib

Component Local Manifests Path Upstream Issue AAW Sign-off CNS Sign-off
kfserve [contrib/kfserve][local-kfserve] v1.7.0 [#361][contrib-upgrade]
spark-operator [contrib/spark-operator][local-spark-operator] v1.7.0 [#361][contrib-upgrade]
seldon [contrib/seldon][local-seldon] v1.7.0 [#361][contrib-upgrade]

Other issues that had to be fixed

NAMESPACES=$(kubectl get namespaces --no-headers | awk '{print $1}')

for ns in $NAMESPACES do kubectl delete poddefaults protected-b -n $ns done



[Part of epic 1.7]:https://github.com/StatCan/aaw/issues/1632
[Previous version]: https://github.com/StatCan/aaw/issues/1336

[aaw-kubeflow-manifests]: https://github.com/statcan/aaw-kubeflow-manifests
[kubeflow-manifests]: https://github.com/kubeflow/manifests/compare/v1.3.1...v1.4.1
[local-centraldashboard]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/apps/centraldashboard
[local-kubeflow-namespace]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/common/kubeflow-namespace
[local-kubeflow-roles]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/common/kubeflow-roles
[local-oidc-authservice]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/common/oidc-authservice
[local-jupyter-web-app]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/apps/jupyter-web-app
[local-katib]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/apps/katib
[local-kfserving]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/apps/kfserving
[local-metacontroller]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/contrib/metacontroller
[local-pipeline]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/apps/pipeline
[local-knative]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/common/knative
[local-profiles]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/apps/profiles
[local-seldon]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/contrib/seldon
[local-notebook-controller]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/apps/notebook-controller
[local-pytorch-job]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/application/pytorch-job
[local-mpi-job]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/application/mpi-job
[local-spark-operator]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/contrib/spark
[local-tf-training]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/application/tf-training
[local-mxnet-job]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/application/mxnet-job
[local-admission-webhook]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/apps/admission-webhook/base
[local-training-operator]:https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/apps/training-operator
[local-kfserve]: https://github.com/StatCan/aaw-kubeflow-manifests/tree/aaw-dev-cc-00/kustomize/contrib/kserve/

[admission-webhook]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/339
[central-dashboard]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/340
[kubeflow-namespace]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/343
[kubeflow-roles]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/345
[oidc-authservice]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/347
[knative]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/349
[jwa]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/351
[katib]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/353
[notebook-controller]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/355
[profiles]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/357
[training-operator]: https://github.com/StatCan/aaw-kubeflow-manifests/issues/359
[contrib-upgrade]:https://github.com/StatCan/aaw-kubeflow-manifests/issues/361
wg102 commented 1 year ago

When trying to update we encountered the argocd-prod 401 error again. To help that souheil, restarted the application controller statefulset. It seems to have helped for now.

After having merged the https://github.com/StatCan/aaw-kubeflow-manifests, we tried to see if it synced, on the Kubeflow application, but since it refers to the argocd which on the jsonnet refers on the kubeflow manifest. So instead we have to update each individual application. By forcing the sync on all out of sync app, we managed to get most of them up and running.

Knative sync failed: Deleted the resource. Then forced sync to recreate it.

Kserve sync failed: Manually synced the crd first, then the manifest so it stopped complaining about the crd not existing

Souheil-Yazji commented 1 year ago

Knative sync resolved by recreating the validation/mutating webhook resources Kserve resolved by updating the CRDs first then syncing the other resources.

Edit: The central dashboard image was not updated correctly in aaw-kubeflow-manifests, neither in dev nor in prod. For prod, a good strategy would be to update the image to the last PR image successfully built in the last sprint. This prevents work that's not been tested/used from possibly trickling down to prod.