canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
97 stars 48 forks source link

running kfp_v2 integration test and an experiment shows that `Cannot get MLMD objects from Metadata store.` #966

Open nishant-dash opened 6 days ago

nishant-dash commented 6 days ago

Bug Description

running kfp_v2 integration test from https://github.com/canonical/charmed-kubeflow-uats/tree/main/tests, commit [0] experiment shows that Cannot get MLMD objects from Metadata store.

[0] fe86b4e255c4c695376f70061a6a645301350d5a

To Reproduce

Environment

Model     Controller  Cloud/Region          Version  SLA          Timestamp
kubeflow  manual      k8s-cloud/<REDACTED>  3.4.3    unsupported  15:37:17Z

SAAS        Status  Store  URL
grafana     active  local  admin/cos.grafana
prometheus  active  local  admin/cos.prometheus

App                        Version                  Status  Scale  Charm                    Channel          Rev  Address       Exposed  Message
admission-webhook                                   active      1  admission-webhook        1.8/stable       301  a.b.c.d  no       
argo-controller                                     active      1  argo-controller          3.3.10/stable    424  a.b.c.d   no       
dex-auth                                            active      1  dex-auth                 2.36/stable      422  a.b.c.d    no       
envoy                      res:oci-image@cc06b3e    active      1  envoy                    2.0/stable       194  a.b.c.d    no       
grafana-agent-k8s          0.40.4                   active      1  grafana-agent-k8s        latest/edge       80  a.b.c.d    no       logging-consumer: off
istio-ingressgateway                                active      1  istio-gateway            1.17/stable     1000  a.b.c.d  no       
istio-pilot                                         active      1  istio-pilot              1.17/stable     1011  a.b.c.d  no       
jupyter-controller                                  active      1  jupyter-controller       1.8/stable       849  a.b.c.d   no       
jupyter-ui                                          active      1  jupyter-ui               1.8/stable       858  a.b.c.d    no       
katib-controller           res:oci-image@31ccd70    active      1  katib-controller         0.16/stable      576  a.b.c.d  no       
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d   no       
katib-db-manager                                    active      1  katib-db-manager         0.16/stable      539  a.b.c.d  no       
katib-ui                                            active      1  katib-ui                 0.16/stable      422  a.b.c.d   no       
kfp-api                                             active      1  kfp-api                  2.0/stable      1283  a.b.c.d  no       
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d   no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      2.0/stable       334  a.b.c.d  no       
kfp-persistence                                     active      1  kfp-persistence          2.0/stable      1291  a.b.c.d  no       
kfp-profile-controller                              active      1  kfp-profile-controller   2.0/stable      1315  a.b.c.d   no       
kfp-schedwf                                         active      1  kfp-schedwf              2.0/stable      1302  a.b.c.d    no       
kfp-ui                                              active      1  kfp-ui                   2.0/stable      1285  a.b.c.d   no       
kfp-viewer                                          active      1  kfp-viewer               2.0/stable      1317  a.b.c.d   no       
kfp-viz                                             active      1  kfp-viz                  2.0/stable      1235  a.b.c.d    no       
knative-eventing                                    active      1  knative-eventing         1.10/stable      353  a.b.c.d   no       
knative-operator                                    active      1  knative-operator         1.10/stable      328  a.b.c.d   no       
knative-serving                                     active      1  knative-serving          1.10/stable      409  a.b.c.d    no       
kserve-controller                                   active      1  kserve-controller        0.11/stable      523  a.b.c.d  no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       1.8/stable       582  a.b.c.d   no       
kubeflow-profiles                                   active      1  kubeflow-profiles        1.8/stable       355  a.b.c.d    no       
kubeflow-roles                                      active      1  kubeflow-roles           1.8/stable       187  a.b.c.d   no       
kubeflow-volumes           res:oci-image@2261827    active      1  kubeflow-volumes         1.8/stable       260  a.b.c.d    no       
metacontroller-operator                             active      1  metacontroller-operator  3.0/stable       252  a.b.c.d   no       
minio                      res:oci-image@1755999    active      1  minio                    ckf-1.8/stable   278  a.b.c.d  no       
mlflow-mysql               8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d    no       
mlflow-server                                       active      1  mlflow-server            2.1/stable       466  a.b.c.d    no       
mlmd                       res:oci-image@44abc5d    active      1  mlmd                     1.14/stable      127  a.b.c.d   no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          ckf-1.8/stable   350  a.b.c.d    no       
pvcviewer-operator                                  active      1  pvcviewer-operator       1.8/stable        30  a.b.c.d  no       
resource-dispatcher                                 active      1  resource-dispatcher      1.0/stable        93  a.b.c.d   no       
seldon-controller-manager                           active      1  seldon-core              1.17/stable      664  a.b.c.d   no       
tensorboard-controller                              active      1  tensorboard-controller   1.8/stable       257  a.b.c.d   no       
tensorboards-web-app                                active      1  tensorboards-web-app     1.8/stable       245  a.b.c.d   no       
training-operator                                   active      1  training-operator        1.7/stable       347  a.b.c.d   no  

Relevant Log Output

$ kubectl logs -n kubeflow  mlmd-0
Defaulted container "mlmd" out of: mlmd, juju-pod-init (init)
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0701 08:27:19.354198     1 metadata_store_server_main.cc:577] Server listening on 0.0.0.0:8080
W0703 15:17:20.894055    10 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: `system.Pipeline`, version: `nullopt`
W0703 15:17:20.922753    10 metadata_store_service_impl.cc:239] GetContextType failed: No type found for query, name: `system.PipelineRun`, version: `nullopt`

Additional Context

No response

syncronize-issues-to-jira[bot] commented 6 days ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5956.

This message was autogenerated

kimwnasptd commented 5 days ago

This should be a transient error shown by MLMD https://github.com/kubeflow/pipelines/issues/8733#issuecomment-1749475865

@nishant-dash could you confirm if the run completed successfully in the end?

kimwnasptd commented 5 days ago

Just saw it finishes, so will go ahead and close this issue

kimwnasptd commented 5 days ago

Re-opening this issue. We managed to reproduce it while bulk testing the Azure one-click deployment.

The message in our case was not transient. The red box kept staying there and also the UI was never getting updated for a pipeline run progress. After looking at the browser's dev-tools we saw that an istio-proxy was responding with 503 errors for the requests that from the browser to MLMD.

A short term solution was to delete the envoy-xxxx-yyyy pod and then the UI's requests succeeded