canonical / charmed-kubeflow-uats

Automated UATs for Charmed Kubeflow
Apache License 2.0
5 stars 2 forks source link

can't successfully run kfp_v2 uats behind proxy #78

Open nishant-dash opened 5 days ago

nishant-dash commented 5 days ago

Bug Description

running kfp_v2 integration test from https://github.com/canonical/charmed-kubeflow-uats/tree/main/tests, commit [0] experiment fails on

$ kubectl logs -n dash pod/condition-v2-bcjmm-2542164255 
I0703 15:00:08.666715      61 launcher_v2.go:90] input ComponentSpec:{
  "inputDefinitions": {
    "parameters": {
      "force_flip_result": {
        "parameterType": "STRING",
        "defaultValue": "",
        "isOptional": true
      }
    }
  },
  "outputDefinitions": {
    "parameters": {
      "Output": {
        "parameterType": "STRING"
      }
    }
  },
  "executorLabel": "exec-flip-coin"
}
I0703 15:00:08.667506      61 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
I0703 15:00:08.667522      61 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
I0703 15:00:08.710062      61 object_store.go:306] Cannot detect minio-service in the same namespace, default to minio-service.kubeflow:9000 as MinIO endpoint.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fe8fe8f9a50>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fe8fea17110>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/

To Reproduce

In a kf 1.8 env behind proxy, run the kfp_v2 uats

Environment

Model     Controller            Cloud/Region  Version  SLA          Timestamp
kubeflow  openstack-<REDACTED>  k8s/default   3.4.3    unsupported  16:12:37Z

SAAS                             Status   Store                 URL
alertmanager-karma-dashboard     active   openstack-<REDACTED>  admin/cos.alertmanager-karma-dashboard
grafana-dashboards               active   openstack-<REDACTED>  admin/cos.grafana-dashboards
loki-logging                     active   openstack-<REDACTED>  admin/cos.loki-logging
prometheus-receive-remote-write  active   openstack-<REDACTED>  admin/cos.prometheus-receive-remote-write
prometheus-scrape                active   openstack-<REDACTED>  admin/cos.prometheus-scrape
scrape-interval-config-metrics   blocked  openstack-<REDACTED>  admin/cos.scrape-interval-config-metrics
scrape-interval-config-monitors  blocked  openstack-<REDACTED>  admin/cos.scrape-interval-config-monitors

App                        Version                  Status  Scale  Charm                    Channel          Rev  Address        Exposed  Message
admission-webhook                                   active      1  admission-webhook        1.8/stable       301  a.b.c.d  no       
argo-controller                                     active      1  argo-controller          3.3.10/stable    424  a.b.c.d  no       
dex-auth                                            active      1  dex-auth                 2.36/stable      422  a.b.c.d   no       
envoy                      res:oci-image@cc06b3e    active      1  envoy                    2.0/stable       194  a.b.c.d   no       
grafana-agent-kubeflow     0.40.4                   active      1  grafana-agent-k8s        latest/edge       80  a.b.c.d   no       
istio-ingressgateway                                active      1  istio-gateway            1.17/stable     1000  a.b.c.d   no       
istio-pilot                                         active      1  istio-pilot              1.17/stable     1011  a.b.c.d  no       
jupyter-controller                                  active      1  jupyter-controller       1.8/stable       849  a.b.c.d   no       
jupyter-ui                                          active      1  jupyter-ui               1.8/stable       858  a.b.c.d     no       
katib-controller           res:oci-image@31ccd70    active      1  katib-controller         0.16/stable      576  a.b.c.d   no       
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d  no       
katib-db-manager                                    active      1  katib-db-manager         0.16/stable      539  a.b.c.d   no       
katib-ui                                            active      1  katib-ui                 0.16/stable      422  a.b.c.d    no       
kfp-api                                             active      1  kfp-api                  2.0/stable      1283  a.b.c.d   no       
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d  no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      2.0/stable       334  a.b.c.d   no       
kfp-persistence                                     active      1  kfp-persistence          2.0/stable      1291  a.b.c.d    no       
kfp-profile-controller                              active      1  kfp-profile-controller   2.0/stable      1315  a.b.c.d  no       
kfp-schedwf                                         active      1  kfp-schedwf              2.0/stable      1302  a.b.c.d   no       
kfp-ui                                              active      1  kfp-ui                   2.0/stable      1285  a.b.c.d   no       
kfp-viewer                                          active      1  kfp-viewer               2.0/stable      1317  a.b.c.d  no       
kfp-viz                                             active      1  kfp-viz                  2.0/stable      1235  a.b.c.d   no       
knative-eventing                                    active      1  knative-eventing         1.10/stable      353  a.b.c.d   no       
knative-operator                                    active      1  knative-operator         1.10/stable      328  a.b.c.d    no       
knative-serving                                     active      1  knative-serving          1.10/stable      409  a.b.c.d   no       
kserve-controller                                   active      1  kserve-controller        0.11/stable      523  a.b.c.d  no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       1.8/stable       454  a.b.c.d    no       
kubeflow-profiles                                   active      1  kubeflow-profiles        1.8/stable       355  a.b.c.d  no       
kubeflow-roles                                      active      1  kubeflow-roles           1.8/stable       187  a.b.c.d    no       
kubeflow-volumes           res:oci-image@2261827    active      1  kubeflow-volumes         1.8/stable       260  a.b.c.d   no       
metacontroller-operator                             active      1  metacontroller-operator  3.0/stable       252  a.b.c.d    no       
minio                      res:oci-image@1755999    active      1  minio                    ckf-1.8/stable   278  a.b.c.d   no       
mlflow-minio               res:oci-image@1755999    active      1  minio                    ckf-1.7/stable   214  a.b.c.d  no       
mlflow-mysql               8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  a.b.c.d  no       
mlflow-server                                       active      1  mlflow-server            2.1/stable       466  a.b.c.d   no       
mlmd                       res:oci-image@44abc5d    active      1  mlmd                     1.14/stable      127  a.b.c.d  no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          ckf-1.8/stable   350  a.b.c.d    no       
pvcviewer-operator                                  active      1  pvcviewer-operator       1.8/stable        30  a.b.c.d  no       
resource-dispatcher                                 active      1  resource-dispatcher      1.0/stable        93  a.b.c.d   no       
seldon-controller-manager                           active      1  seldon-core              1.17/stable      664  a.b.c.d    no       
tensorboard-controller                              active      1  tensorboard-controller   1.8/stable       257  a.b.c.d   no       
tensorboards-web-app                                active      1  tensorboards-web-app     1.8/stable       245  a.b.c.d    no       
training-operator                                   active      1  training-operator        1.7/stable       347  a.b.c.d   no       

Relevant Log Output

$ kubectl get all -n dash
NAME                                                  READY   STATUS      RESTARTS   AGE
pod/condition-v2-bcjmm-1791838033                     0/2     Completed   0          3m40s
pod/condition-v2-bcjmm-2085347550                     0/2     Completed   0          4m
pod/condition-v2-bcjmm-2542164255                     2/2     Running     0          3m30s
pod/ml-pipeline-ui-artifact-6b89ccc469-djz6v          2/2     Running     0          5d5h
pod/ml-pipeline-visualizationserver-955b54775-l9v7p   2/2     Running     0          4d20h
pod/test-dash-2-0                                     2/2     Running     0          9m54s

NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/ml-pipeline-ui-artifact           ClusterIP   10.87.227.78    <none>        80/TCP     5d5h
service/ml-pipeline-visualizationserver   ClusterIP   10.87.207.107   <none>        8888/TCP   5d5h
service/test-dash-2                       ClusterIP   10.87.85.27     <none>        80/TCP     9m54s

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ml-pipeline-ui-artifact           1/1     1            1           5d5h
deployment.apps/ml-pipeline-visualizationserver   1/1     1            1           5d5h

NAME                                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/ml-pipeline-ui-artifact-6b89ccc469          1         1         1       5d5h
replicaset.apps/ml-pipeline-visualizationserver-955b54775   1         1         1       5d5h

NAME                           READY   AGE
statefulset.apps/test-dash-2   1/1     9m54s

Additional Context

No response

syncronize-issues-to-jira[bot] commented 5 days ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5957.

This message was autogenerated

NohaIhab commented 17 hours ago

Hi @nishant-dash ,

  1. Can you provide the logs you get from the notebook execution? It'd be useful to see in which step the notebook is failing to see what part of it is trying to reach the internet.
  2. Have you tried configuring the pipeline as linked in the example notebook in our how to guide? If you have a different configuration please provide us with that as well.
  3. Which Kubernetes and what version of it are you using?

In the meantime, we are prioritizing this issue and will try to reproduce it.

nishant-dash commented 16 hours ago

for

  1. it fails on the last cell of the kfp v2 integration test notebook
    
    ---------------------------------------------------------------------------
    AssertionError                            Traceback (most recent call last)
    Cell In[14], line 4
      1 # fetch KFP experiment to ensure it exists
      2 client.get_experiment(experiment_name=EXPERIMENT_NAME)
    ----> 4 assert_run_succeeded(client, run.run_id)

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:336, in BaseRetrying.wraps..wrapped_f(*args, *kw) 334 copy = self.copy() 335 wrapped_f.statistics = copy.statistics # type: ignore[attr-defined] --> 336 return copy(f, args, **kw)

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:475, in Retrying.call(self, fn, *args, **kwargs) 473 retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs) 474 while True: --> 475 do = self.iter(retry_state=retry_state) 476 if isinstance(do, DoAttempt): 477 try:

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:376, in BaseRetrying.iter(self, retry_state) 374 result = None 375 for action in self.iter_state.actions: --> 376 result = action(retry_state) 377 return result

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:418, in BaseRetrying._post_stop_check_actions..exc_check(rs) 416 retry_exc = self.retry_error_cls(fut) 417 if self.reraise: --> 418 raise retry_exc.reraise() 419 raise retry_exc from fut.exception()

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:185, in RetryError.reraise(self) 183 def reraise(self) -> t.NoReturn: 184 if self.last_attempt.failed: --> 185 raise self.last_attempt.result() 186 raise self

File /opt/conda/lib/python3.11/concurrent/futures/_base.py:449, in Future.result(self, timeout) 447 raise CancelledError() 448 elif self._state == FINISHED: --> 449 return self.__get_result() 451 self._condition.wait(timeout) 453 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File /opt/conda/lib/python3.11/concurrent/futures/_base.py:401, in Future.__get_result(self) 399 if self._exception: 400 try: --> 401 raise self._exception 402 finally: 403 # Break a reference cycle with the exception in self._exception 404 self = None

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:478, in Retrying.call(self, fn, *args, *kwargs) 476 if isinstance(do, DoAttempt): 477 try: --> 478 result = fn(args, **kwargs) 479 except BaseException: # noqa: B902 480 retry_state.set_exception(sys.exc_info()) # type: ignore[arg-type]

Cell In[13], line 9, in assert_run_succeeded(client, run_id) 7 """Wait for the run to complete successfully.""" 8 status = client.get_run(run_id).state ----> 9 assert status == "SUCCEEDED", f"KFP run in {status} state."

AssertionError: KFP run in RUNNING state.

2. but isn't this for a cluster internal service ? maybe kfp? (perhaps the no proxy needs tweaking at the containerd level?)
```console
Network is unreachable')': /simple/kfp/
DnPlas commented 14 hours ago

@nishant-dash also for reproducing the issue, could you please tell us which method for running the UATs are you using? It was not clear from the issue description.

a) From inside a notebook b) Using the driver