canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
102 stars 49 forks source link

Provide bundle definitions for airgapped installations #680

Closed kimwnasptd closed 12 months ago

kimwnasptd commented 1 year ago

In order for users to deploy CKF in an airgapped environment https://github.com/canonical/bundle-kubeflow/issues/678 they'll need to be able to

  1. Define all the --resources of a Charm https://discourse.charmhub.io/t/possible-to-set-default-for-oci-image-resource/2525/2
  2. Make juju config commands, to change what images the charms would use (i.e. change default notebook images, seldon/kserve servers etc)

The proposed way in juju to track this configuration metadata is on a bundle (from the discourse link above).

So we should create airgapped specific bundles that users could take as a template, update their images on top, and apply to get CKF working in an airgapped environment. Let's start with an example of this for latest/edge bundle

kimwnasptd commented 1 year ago

Most of the Charm configurations should be straight forward, to run in airgapped. We see these 3 cases:

  1. Sidecar pattern charms. Here we'll need to configure the resources and config of charm (if necessary)
  2. PodSpec charms. These are the same as sidecars in this case, since we only need to configure resources and config
  3. Charms that use other Operators for installing software

The most challenging from the above that we need to test is number 3. In our case we have the Istio Charm https://github.com/canonical/istio-operators/pull/316 and the Knative Charm that use operators https://github.com/canonical/knative-operators/issues/140.

These 2 should be the most involved to test, and the ones we expect most friction with.

kimwnasptd commented 1 year ago

I managed to deploy the https://github.com/canonical/dex-auth-operator/ charm with the following command

juju deploy ./dex-auth_6902b52.charm \
    --resource oci-image=172.17.0.2:5000/dexidp/dex:v2.31.2
orfeas-k commented 1 year ago

Comment to gather commands for deploying all applications of bundle latest/edge

Referencing its bundle.yaml, here are the bundle's applications. Edit this comment as you move forward with deploying individual charms in an airgapped environment.

juju deploy ./charm --resource oci-image=... #example command as placeholder
kimwnasptd commented 1 year ago

I run the latest Dex charm to verify https://github.com/canonical/dex-auth-operator/pull/148. Indeed the Charm worked as expected! I also run

juju config dex-auth static-username=admin
juju config dex-auth static-password=admin

To make sure the code used bcrypt.

For the sake of science I also deployed the older charm, without that PR, and it indeed was failing

Error logs ``` 2023-08-28T15:42:24.132Z [container-agent] 2023-08-28 15:42:24 INFO juju.worker.uniter resolver.go:155 awaiting error resolution for "install" hook 2023-08-28T15:42:24.484Z [container-agent] 2023-08-28 15:42:24 WARNING install 2023-08-28T15:42:24.484Z [container-agent] 2023-08-28 15:42:24 WARNING install WARNING: apt does not have a stable CLI interface. Use with caution in scripts. 2023-08-28T15:42:24.484Z [container-agent] 2023-08-28 15:42:24 WARNING install 2023-08-28T15:42:25.451Z [pebble] Check "readiness" failure 96 (threshold 3): received non-20x status code 418 2023-08-28T15:42:32.343Z [container-agent] 2023-08-28 15:42:32 WARNING install E: The repository 'http://archive.ubuntu.com/ubuntu focal-updates Release' does not have a Release file. 2023-08-28T15:42:32.343Z [container-agent] 2023-08-28 15:42:32 WARNING install E: The repository 'http://archive.ubuntu.com/ubuntu focal-backports Release' does not have a Release file. 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install Traceback (most recent call last): 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install File "./src/charm.py", line 23, in 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install import bcrypt 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install ModuleNotFoundError: No module named 'bcrypt' 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install During handling of the above exception, another exception occurred: 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install Traceback (most recent call last): 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install File "./src/charm.py", line 25, in 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install subprocess.check_call(["apt", "update"]) 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install File "/usr/lib/python3.8/subprocess.py", line 364, in check_call 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install raise CalledProcessError(retcode, cmd) 2023-08-28T15:42:32.344Z [container-agent] 2023-08-28 15:42:32 WARNING install subprocess.CalledProcessError: Command '['apt', 'update']' returned non-zero exit status 100. 2023-08-28T15:42:32.586Z [container-agent] 2023-08-28 15:42:32 ERROR juju.worker.uniter.operation runhook.go:153 hook "install" (via hook dispatching script: dispatch) failed: exit status 1 ```
orfeas-k commented 1 year ago

We bumped onto this https://github.com/canonical/knative-operators/issues/147 so for Knative-serving, we will be configuring it to use 1.8.0 (knative-eventing already uses 1.8.0)

NohaIhab commented 1 year ago

To be able to have the bundle definition ready, we need the following:

  1. All charms are published with all changes required to make images configurable
  2. Be able to get the list of images used by charms in latest/edge

Pending for charms :

NohaIhab commented 1 year ago

Bundle definition for air-gapped installation

bundle: kubernetes
name: kubeflow
applications:
  admission-webhook:
    charm: ./admission-webhook_98aac65.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/kubeflownotebookswg/poddefaults-webhook:v1.7.0
  argo-controller:
    charm: ./argo-controller_b59eaec.charm
    scale: 1
    resources:
      oci-image: 172.17.0.2:5000/argoproj/workflow-controller:v3.3.8
    options:
      executor-image: 172.17.0.2:5000/argoproj/argoexec:v3.3.8
  argo-server:
    charm: ./argo-server_6d22972.charm
    scale: 1
    resources:
      oci-image: 172.17.0.2:5000/argoproj/argocli:v3.3.8
  dex-auth:
    charm: ./dex-auth_f0211e2.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/dexidp/dex:v2.36.0
  istio-ingressgateway:
    charm: ./istio-gateway_ubuntu-20.04-amd64.charm
    scale: 1
    trust: true
    options:
      kind: ingress
      proxy-image: 172.17.0.2:5000/istio/proxyv2:1.17.3
  istio-pilot:
    charm: ./istio-pilot_ubuntu-20.04-amd64.charm
    scale: 1
    trust: true
    options:
      default-gateway: kubeflow-gateway
      image-configuration: '{"pilot-image": pilot, "global-tag": 1.17.3, "global-hub": 172.17.0.2:5000/istio, "global-proxy-image": proxyv2, "global-proxy-init-image": proxyv2, "grpc-bootstrap-init": busybox:1.28}'
  jupyter-controller:
    charm: ./jupyter-controller_4b8d674.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/kubeflownotebookswg/notebook-controller:v1.7.0
  jupyter-ui:
    charm: ./jupyter-ui_0af4218.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/kubeflownotebookswg/jupyter-web-app:v1.7.0
    options:
      jupyter-images: "['172.17.0.2:5000/kubeflownotebookswg/jupyter-scipy:v1.7.0','172.17.0.2:5000/kubeflownotebookswg/jupyter-pytorch-full:v1.7.0','172.17.0.2:5000/kubeflownotebookswg/jupyter-pytorch-cuda-full:v1.7.0','172.17.0.2:5000/kubeflownotebookswg/jupyter-tensorflow-full:v1.7.0','172.17.0.2:5000/kubeflownotebookswg/jupyter-tensorflow-cuda-full:v1.7.0']"
      rstudio-images: "['172.17.0.2:5000/kubeflownotebookswg/rstudio-tidyverse:v1.7.0']"
      vscode-images: "['172.17.0.2:5000/kubeflownotebookswg/codeserver-python:v1.7.0']"
  katib-controller:
    charm: ./katib-controller_afbe7fb.charm
    scale: 1
    resources:
      oci-image: 172.17.0.2:5000/kubeflowkatib/katib-controller:v0.16.0-rc.1
    options:
      custom_images: '{"default_trial_template": "172.17.0.2:5000/kubeflowkatib/mxnet-mnist:v0.16.0-rc.1","early_stopping__medianstop": "172.17.0.2:5000/kubeflowkatib/earlystopping-medianstop:v0.16.0-rc.1","enas_cpu_template": "172.17.0.2:5000/kubeflowkatib/enas-cnn-cifar10-cpu:v0.16.0-rc.1","metrics_collector_sidecar__stdout": "172.17.0.2:5000/kubeflowkatib/file-metrics-collector:v0.16.0-rc.1","metrics_collector_sidecar__file": "172.17.0.2:5000/kubeflowkatib/file-metrics-collector:v0.16.0-rc.1","metrics_collector_sidecar__tensorflow_event": "172.17.0.2:5000/kubeflowkatib/tfevent-metrics-collector:v0.16.0-rc.1","pytorch_job_template__master": "172.17.0.2:5000/kubeflowkatib/pytorch-mnist-cpu:v0.16.0-rc.1","pytorch_job_template__worker": "172.17.0.2:5000/kubeflowkatib/pytorch-mnist-cpu:v0.16.0-rc.1","suggestion__random": "172.17.0.2:5000/kubeflowkatib/suggestion-hyperopt:v0.16.0-rc.1","suggestion__tpe": "172.17.0.2:5000/kubeflowkatib/suggestion-hyperopt:v0.16.0-rc.1","suggestion__grid": "172.17.0.2:5000/kubeflowkatib/suggestion-optuna:v0.16.0-rc.1","suggestion__hyperband": "172.17.0.2:5000/kubeflowkatib/suggestion-hyperband:v0.16.0-rc.1","suggestion__bayesianoptimization": "172.17.0.2:5000/kubeflowkatib/suggestion-skopt:v0.16.0-rc.1","suggestion__cmaes": "172.17.0.2:5000/kubeflowkatib/suggestion-goptuna:v0.16.0-rc.1","suggestion__sobol": "172.17.0.2:5000/kubeflowkatib/suggestion-goptuna:v0.16.0-rc.1","suggestion__multivariate_tpe": "172.17.0.2:5000/kubeflowkatib/suggestion-optuna:v0.16.0-rc.1","suggestion__enas": "172.17.0.2:5000/kubeflowkatib/suggestion-enas:v0.16.0-rc.1","suggestion__darts": "172.17.0.2:5000/kubeflowkatib/suggestion-darts:v0.16.0-rc.1","suggestion__pbt": "172.17.0.2:5000/kubeflowkatib/suggestion-pbt:v0.16.0-rc.1", }'
  katib-db:
    charm: ./mysql-k8s_10afaca.charm
    scale: 1
    trust: true
    constraints: mem=2G
    resources:
      mysql-image: 172.17.0.2:5000/canonical/charmed-mysql:753477ce39712221f008955b746fcf01a215785a215fe3de56f525380d14ad97
  katib-db-manager:
    charm: ./katib-db-manager_cb61fe0.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/kubeflowkatib/katib-db-manager:v0.16.0-rc.1
  katib-ui:
    charm: ./katib-ui_d317886.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/kubeflowkatib/katib-ui:v0.16.0-rc.1
  kfp-api:
    charm: ./kfp-api_5708923.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/charmedkubeflow/api-server:2.0.0-alpha.7_20.04_1
  kfp-db:
    charm: ./mysql-k8s_10afaca.charm
    scale: 1
    trust: true
    constraints: mem=2G
    resources:
      mysql-image: 172.17.0.2:5000/canonical/charmed-mysql:753477ce39712221f008955b746fcf01a215785a215fe3de56f525380d14ad97
  kfp-persistence:
    charm: ./kfp-persistence_a7d1ba7.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/charmedkubeflow/persistenceagent:2.0.0-alpha.7_22.04_1
  kfp-profile-controller:
    charm: ./kfp-profile-controller_527ffbc.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/python:3.7
  kfp-schedwf:
    charm: ./kfp-schedwf_31d7d73.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/charmedkubeflow/scheduledworkflow:2.0.0-alpha.7_22.04_1
  kfp-ui:
    charm: ./kfp-ui_dd3a136.charm
    scale: 1
    trust: true
    resources:
      ml-pipeline-ui: 172.17.0.2:5000/ml-pipeline/frontend:2.0.0-alpha.7
  kfp-viewer:
    charm: ./kfp-viewer_17bb76d.charm
    scale: 1
    trust: true
    resources:
      kfp-viewer-image: 172.17.0.2:5000/charmedkubeflow/viewer-crd-controller:2.0.0-alpha.7_22.04_1
  kfp-viz:
    charm: ./kfp-viz_874d439.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/ml-pipeline/visualization-server:2.0.0-alpha.7
  knative-eventing:
    charm: ./knative-eventing_d160a86.charm
    scale: 1
    trust: true
    options:
      namespace: knative-eventing
      custom_images: '{ "eventing-webhook/eventing-webhook": "172.17.0.2:5000/knative-releases/knative.dev/eventing/cmd/webhook:c9c582f530155d22c01b43957ae0dba549b1cc903f77ec6cc1acb9ae9085be62", "eventing-controller/eventing-controller": "172.17.0.2:5000/knative-releases/knative.dev/eventing/cmd/controller:cbc452f35842cc8a78240642adc1ebb11a4c4d7c143c8277edb49012f6cfc5d3", "mt-broker-filter/filter": "172.17.0.2:5000/knative-releases/knative.dev/eventing/cmd/broker/filter:33ea8a657b974d7bf3d94c0b601a4fc287c1fb33430b3dda028a1a189e3d9526", "mt-broker-ingress/ingress": "172.17.0.2:5000/knative-releases/knative.dev/eventing/cmd/broker/ingress:f4a9dfce9eec5272c90a19dbdf791fffc98bc5a6649ee85cb8a29bd5145635b1", "mt-broker-controller/mt-broker-controller": "172.17.0.2:5000/knative-releases/knative.dev/eventing/cmd/mtchannel_broker:c5d3664780b394f6d3e546eb94c972965fbd9357da5e442c66455db7ca94124c", "imc-controller/controller": "172.17.0.2:5000/knative-releases/knative.dev/eventing/cmd/in_memory/channel_controller:3ced549336c7ccf3bb2adf23a558eb55bd1aec7be17837062d21c749dfce8ce5", "imc-dispatcher/dispatcher": "172.17.0.2:5000/knative-releases/knative.dev/eventing/cmd/in_memory/channel_dispatcher:e17bbdf951868359424cd0a0465da8ef44c66ba7111292444ce555c83e280f1a", "pingsource-mt-adapter/dispatcher": "172.17.0.2:5000/knative-releases/knative.dev/eventing/cmd/mtping:bc200a12cbad35bea51aabe800a365f28a5bd1dd65b3934b3db2e7e22df37efd", "migrate": "172.17.0.2:5000/knative-releases/knative.dev/pkg/apiextensions/storageversion/cmd/migrate:59431cf8337532edcd9a4bcd030591866cc867f13bee875d81757c960a53668d", }'
  knative-operator:
    charm: ./knative-operator_fa7a1d1.charm
    scale: 1
    trust: true
    resources:
      knative-operator-image: 172.17.0.2:5000/knative-releases/knative.dev/operator/cmd/operator:v1.10.3
      knative-operator-webhook-image: 172.17.0.2:5000/knative-releases/knative.dev/operator/cmd/webhook:v1.10.3
  knative-serving:
    charm: ./knative-serving_a506810.charm
    scale: 1
    trust: true
    options:
      namespace: knative-serving
      istio.gateway.namespace: kubeflow
      istio.gateway.name: kubeflow-gateway
      version: 1.8.0
      custom_images: '{ "activator": "172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/activator:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1", "autoscaler": "172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/autoscaler:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f", "controller": "172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/controller:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95", "webhook": "172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/webhook:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae", "autoscaler-hpa": "172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/autoscaler-hpa:7003443f0faabbaca12249aa16b73fa171bddf350abd826dd93b06f5080a146d", "net-istio-controller/controller": "172.17.0.2:5000/knative-releases/knative.dev/net-istio/cmd/controller:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918", "net-istio-webhook/webhook": "172.17.0.2:5000/knative-releases/knative.dev/net-istio/cmd/webhook:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d", "domain-mapping": "172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/domain-mapping:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96", "domainmapping-webhook": "172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470", "migrate": "172.17.0.2:5000/knative-releases/knative.dev/pkg/apiextensions/storageversion/cmd/migrate:d0095787bc1687e2d8180b36a66997733a52f8c49c3e7751f067813e3fb54b66", "queue-proxy": "172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/queue:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c", }'
  kserve-controller:
    charm: ./kserve-controller_4bd19bf.charm
    scale: 1
    trust: true
    options:
      deployment-mode: rawdeployment
      custom_images: '{ "configmap__agent": "172.17.0.2:5000/kserve/agent:v0.10.0", "configmap__batcher": "172.17.0.2:5000/kserve/agent:v0.10.0", "configmap__explainers__alibi": "172.17.0.2:5000/kserve/alibi-explainer:latest", "configmap__explainers__aix": "172.17.0.2:5000/kserve/aix-explainer:latest", "configmap__explainers__art": "172.17.0.2:5000/kserve/art-explainer:latest", "configmap__logger": "172.17.0.2:5000/kserve/agent:v0.10.0", "configmap__router": "172.17.0.2:5000/kserve/router:v0.10.0", "configmap__storageInitializer": "172.17.0.2:5000/kserve/storage-initializer:v0.10.0", "serving_runtimes__lgbserver": "172.17.0.2:5000/kserve/lgbserver:v0.10.0", "serving_runtimes__kserve_mlserver": "172.17.0.2:5000/seldonio/mlserver:1.0.0", "serving_runtimes__paddleserver": "172.17.0.2:5000/kserve/paddleserver:v0.10.0", "serving_runtimes__pmmlserver": "172.17.0.2:5000/kserve/pmmlserver:v0.10.0", "serving_runtimes__sklearnserver": "172.17.0.2:5000/kserve/sklearnserver:v0.10.0", "serving_runtimes__tensorflow_serving": "172.17.0.2:5000/tensorflow/serving:2.6.2", "serving_runtimes__torchserve": "172.17.0.2:5000/pytorch/torchserve-kfs:0.7.0", "serving_runtimes__tritonserver": "172.17.0.2:5000/nvidia/tritonserver:21.09-py3", "serving_runtimes__xgbserver": "172.17.0.2:5000/kserve/xgbserver:v0.10.0", }'
    resources:
      kserve-controller-image: 172.17.0.2:5000/kserve/kserve-controller:v0.10.0
      kube-rbac-proxy-image: 172.17.0.2:5000/kubebuilder/kube-rbac-proxy:v0.10.0
  kubeflow-dashboard:
    charm: ./kubeflow-dashboard_f138e5a.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/kubeflownotebookswg/centraldashboard:v1.7.0
  kubeflow-profiles:
    charm: ./kubeflow-profiles_52cc101.charm
    scale: 1
    trust: true
    resources:
      profile-image: 172.17.0.2:5000/kubeflownotebookswg/profile-controller:v1.7.0
      kfam-image: 172.17.0.2:5000/kubeflownotebookswg/kfam:v1.7.0
  kubeflow-roles:
    charm: ./kubeflow-roles_d034aa7.charm
    scale: 1
    trust: true
  kubeflow-volumes:
    charm: ./kubeflow-volumes_c647a89.charm
    scale: 1
    resources:
      oci-image: 172.17.0.2:5000/kubeflownotebookswg/volumes-web-app:v1.7.0
  metacontroller-operator:
    charm: ./metacontroller-operator_0adbc5a.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/metacontrollerio/metacontroller:v3.0.0
  minio:
    charm: ./minio_0d03693.charm
    scale: 1
    resources:
      oci-image: 172.17.0.2:5000/minio/minio:RELEASE.2021-09-03T03-56-13Z
  oidc-gatekeeper:
    charm: ./oidc-gatekeeper_2d6d677.charm
    scale: 1
    resources:
      oci-image: 172.17.0.2:5000/arrikto/kubeflow/oidc-authservice:e236439
  seldon-controller-manager:
    charm: ./seldon-core_9a712f3.charm
    scale: 1
    trust: true
    resources:
      oci-image: 172.17.0.2:5000/charmedkubeflow/seldon-core-operator:v1.15.0_22.04_1
    options:
      custom_images: '{ "configmap__predictor__tensorflow__tensorflow": "172.17.0.2:5000/tensorflow/serving:2.1.0", "configmap__predictor__tensorflow__seldon": "172.17.0.2:5000/seldonio/tfserving-proxy:1.15.0", "configmap__predictor__sklearn__seldon": "172.17.0.2:5000/charmedkubeflow/sklearnserver:v1.16.0_20.04_1", "configmap__predictor__sklearn__v2": "172.17.0.2:5000/charmedkubeflow/mlserver-sklearn:1.2.0_22.04_1", "configmap__predictor__xgboost__seldon": "172.17.0.2:5000/seldonio/xgboostserver:1.15.0", "configmap__predictor__xgboost__v2": "172.17.0.2:5000/charmedkubeflow/mlserver-xgboost:1.2.0_22.04_1", "configmap__predictor__mlflow__seldon": "172.17.0.2:5000/seldonio/mlflowserver:1.15.0", "configmap__predictor__mlflow__v2": "172.17.0.2:5000/charmedkubeflow/mlserver-mlflow:1.2.0_22.04_1", "configmap__predictor__triton__v2": "172.17.0.2:5000/nvidia/tritonserver:21.08-py3", "configmap__predictor__huggingface__v2": "172.17.0.2:5000/charmedkubeflow/mlserver-huggingface:1.2.4_22.04_1", "configmap__predictor__tempo_server__v2": "172.17.0.2:5000/seldonio/mlserver:1.2.0-slim", "configmap_storageInitializer": "172.17.0.2:5000/seldonio/rclone-storage-initializer:1.14.1", "configmap_explainer": "172.17.0.2:5000/seldonio/alibiexplainer:1.15.0", "configmap_explainer_v2": "172.17.0.2:5000/seldonio/mlserver:1.2.0-alibi-explain", }'
  tensorboard-controller:
    charm: ./tensorboard-controller_9cc1392.charm
    scale: 1
    trust: true
    resources:
      tensorboard-controller-image: 172.17.0.2:5000/kubeflownotebookswg/tensorboard-controller:v1.7.0
  tensorboards-web-app:
    charm: ./tensorboards-web-app_97ed301.charm
    scale: 1
    trust: true
    resources:
      tensorboards-web-app-image: 172.17.0.2:5000/kubeflownotebookswg/tensorboards-web-app:v1.7.0
  training-operator:
    charm: ./training-operator_6151cbb.charm
    scale: 1
    trust: true
    resources:
      training-operator-image: 172.17.0.2:5000/kubeflow/training-operator:v1-66aa635
relations:
  - [argo-controller, minio]
  - [dex-auth:oidc-client, oidc-gatekeeper:oidc-client]
  - [istio-pilot:ingress, dex-auth:ingress]
  - [istio-pilot:ingress, jupyter-ui:ingress]
  - [istio-pilot:ingress, katib-ui:ingress]
  - [istio-pilot:ingress, kfp-ui:ingress]
  - [istio-pilot:ingress, kubeflow-dashboard:ingress]
  - [istio-pilot:ingress, kubeflow-volumes:ingress]
  - [istio-pilot:ingress, oidc-gatekeeper:ingress]
  - [istio-pilot:ingress-auth, oidc-gatekeeper:ingress-auth]
  - [istio-pilot:istio-pilot, istio-ingressgateway:istio-pilot]
  - [istio-pilot:ingress, tensorboards-web-app:ingress]
  - [istio-pilot:gateway-info, tensorboard-controller:gateway-info]
  - [katib-db-manager:relational-db, katib-db:database]
  - [kfp-api:relational-db, kfp-db:database]
  - [kfp-api:kfp-api, kfp-persistence:kfp-api]
  - [kfp-api:kfp-api, kfp-ui:kfp-api]
  - [kfp-api:kfp-viz, kfp-viz:kfp-viz]
  - [kfp-api:object-storage, minio:object-storage]
  - [kfp-profile-controller:object-storage, minio:object-storage]
  - [kfp-ui:object-storage, minio:object-storage]
  - [kserve-controller:ingress-gateway, istio-pilot:gateway-info]
  - [kserve-controller:local-gateway, knative-serving:local-gateway]
  - [kubeflow-profiles, kubeflow-dashboard]
  - [kubeflow-dashboard:links, jupyter-ui:dashboard-links]
  - [kubeflow-dashboard:links, katib-ui:dashboard-links]
  - [kubeflow-dashboard:links, kfp-ui:dashboard-links]
  - [kubeflow-dashboard:links, kubeflow-volumes:dashboard-links]
  - [kubeflow-dashboard:links, tensorboards-web-app:dashboard-links]
orfeas-k commented 1 year ago

Regarding knative-serving, we also didn't use the queue-proxy image that is used to spin up a container of the same name. In case we need to use it, we should add the following line in the custom_images configuration

    "queue-proxy": "172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/queue:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c",

EDIT: We need the queue-proxy image after all, so I 'll update the command accordingly.

orfeas-k commented 1 year ago

Updated knative-serving command and it works (almost).

The only issue is with the activator pod which never goes to Running with Ready 1/1 and ends up in a CrashLoopBackOff state. We 've gathered the following errors:

"severity":"ERROR","timestamp":"2023-09-04T08:41:28.454200818Z","logger":"activator","caller":"websocket/connection.go:144","message":"Websocket connection could not be established","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f","error":"dial tcp: lookup autoscaler.knative-serving.svc.cluster.local: i/o timeout","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func1\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:144\nknative.dev/pkg/websocket.(*ManagedConnection).connect.func1\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:225\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:222\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:235\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:228\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:423\nknative.dev/pkg/websocket.(*ManagedConnection).connect\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:222\nknative.dev/pkg/websocket.NewDurableConnection.func2\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:162"}
{"severity":"ERROR","timestamp":"2023-09-04T08:41:28.787749703Z","logger":"activator","caller":"websocket/connection.go:191","message":"Failed to send ping message to ws://autoscaler.knative-serving.svc.cluster.local:8080","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f","error":"connection has not yet been established","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:191"}
{"severity":"WARNING","timestamp":"2023-09-04T08:41:31.05744278Z","logger":"activator","caller":"handler/healthz_handler.go:36","message":"Healthcheck failed: connection has not yet been established","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f"}

Looking into it with @kimwnasptd, we believe this may be an issue with the gateway not working the testing environment (we have istio deployed in the cluster, but not functioning). We expect it to go away when deployed as part of CKF bundle. Here is a relevant issue https://github.com/knative/serving/issues/4407 and here's one that we looked into but we do not think that it really relates to our own https://github.com/knative/serving/issues/11544.

Make sure everything is ok

Get all the deployments and pods in knative-serving namespace and make sure there are no error and that they 're all running (or have run).

kimwnasptd commented 1 year ago

After the metacontroller PR was merged https://github.com/canonical/metacontroller-operator/pull/83 I managed to deploy the metacontroller charm with the following command:

juju deploy ./metacontroller-operator_ubuntu-20.04-amd64.charm --config metacontroller-image=172.17.0.2:5000/metacontrollerio/metacontroller:v3.0.0
i-chvets commented 12 months ago

Task is completed. Closing.