canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
97 stars 48 forks source link

ci(eks): Training-operator UAT fails on EKS k8s 1.26 #910

Closed misohu closed 2 weeks ago

misohu commented 1 month ago

Bug Description

Training-operator UAT starts failing k8s-version to 1.26 on EKS with Notebook execution failed with RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0. This is the case for CKF 1.8/stable. Unfortunately, we do not have more detailed logs due to known limitation of how our UATs run https://github.com/canonical/charmed-kubeflow-uats/issues/4.

To Reproduce

Run github action called Create EKS cluster, deploy CKF and run bundle test from GitHub UI

Environment

EKS k8s 1.26 Juju 3.5

for 1.8 juju status

Model     Controller           Cloud/Region      Version  SLA          Timestamp
kubeflow  kubeflow-controller  eks/eu-central-1  3.5.0    unsupported  11:49:03Z

App                        Version                  Status       Scale  Charm                    Channel          Rev  Address         Exposed  Message
admission-webhook                                   active           1  admission-webhook        1.[8](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:9)/stable       301  10.100.115.59   no       
argo-controller                                     active           1  argo-controller          3.3.10/stable    424  10.100.175.7    no       
dex-auth                                            active           1  dex-auth                 2.36/stable      422  10.100.1[9](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:10).81    no       
envoy                      res:oci-image@cc06b3e    active           1  envoy                    2.0/stable       194  [10](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:11).100.207.115  no       
istio-ingressgateway                                active           1  istio-gateway            1.17/stable      723  10.100.89.199   no       
istio-pilot                                         active           1  istio-pilot              1.17/stable      827  10.100.91.2     no       
jupyter-controller                                  active           1  jupyter-controller       1.8/stable       849  10.100.2[11](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:12).10   no       
jupyter-ui                                          active           1  jupyter-ui               1.8/stable       858  10.100.7.15     no       
katib-controller           res:oci-image@b6a6100    active           1  katib-controller         0.16/stable      446  10.100.181.73   no       
katib-db                   8.0.35-0ubuntu0.22.04.1  active           1  mysql-k8s                8.0/stable       [12](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:13)7  10.100.250.105  no       
katib-db-manager                                    active           1  katib-db-manager         0.16/stable      411  10.100.65.150   no       
katib-ui                                            active           1  katib-ui                 0.16/stable      422  10.100.247.57   no       
kfp-api                                             active           1  kfp-api                  2.0/stable      1283  10.100.154.[13](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:14)   no       
kfp-db                     8.0.35-0ubuntu0.22.04.1  active           1  mysql-k8s                8.0/stable       127  10.100.130.176  no       
kfp-metadata-writer                                 active           1  kfp-metadata-writer      2.0/stable       334  10.100.237.178  no       
kfp-persistence                                     active           1  kfp-persistence          2.0/stable      1291  10.100.78.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:15)4   no       
kfp-profile-controller                              active           1  kfp-profile-controller   2.0/stable      1248  10.100.94.20    no       
kfp-schedwf                                         active           1  kfp-schedwf              2.0/stable      1302  10.100.226.135  no       
kfp-ui                                              active           1  kfp-ui                   2.0/stable      1285  10.100.232.129  no       
kfp-viewer                                          active           1  kfp-viewer               2.0/stable      1317  10.100.190.218  no       
kfp-viz                                             active           1  kfp-viz                  2.0/stable      1235  10.100.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:16)5.163  no       
knative-eventing                                    active           1  knative-eventing         1.10/stable      353  10.100.100.13   no       
knative-operator                                    active           1  knative-operator         1.10/stable      328  10.100.206.140  no       
knative-serving                                     active           1  knative-serving          1.10/stable      354  10.100.27.99    no       
kserve-controller                                   active           1  kserve-controller        0.11/stable      523  10.100.195.144  no       
kubeflow-dashboard                                  active           1  kubeflow-dashboard       1.8/stable       454  10.100.208.83   no       
kubeflow-profiles                                   active           1  kubeflow-profiles        1.8/stable       355  10.100.250.185  no       
kubeflow-roles                                      active           1  kubeflow-roles           1.8/stable       187  10.100.221.183  no       
kubeflow-volumes           res:oci-image@2261827    active           1  kubeflow-volumes         1.8/stable       260  10.100.202.188  no       
metacontroller-operator                             active           1  metacontroller-operator  3.0/stable       252  10.100.79.149   no       
minio                      res:oci-image@1755999    active           1  minio                    ckf-1.8/stable   278  10.100.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:17)8.215  no       
mlmd                       res:oci-image@44abc5d    active           1  mlmd                     1.14/stable      127  10.100.74.191   no       
oidc-gatekeeper                                     active           1  oidc-gatekeeper          ckf-1.8/stable   350  10.100.112.203  no       
pvcviewer-operator                                  maintenance      1  pvcviewer-operator       1.8/stable        30  10.100.[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:18)1.53   no       Reconciling charm: executing component kubernetes:auth-and-crds
seldon-controller-manager                           active           1  seldon-core              1.17/stable      664  10.100.227.113  no       
tensorboard-controller                              active           1  tensorboard-controller   1.8/stable       257  10.100.47.35    no       
tensorboards-web-app                                active           1  tensorboards-web-app     1.8/stable       245  10.100.202.39   no       
training-operator                                   active           1  training-operator        1.7/stable       347  10.100.202.2    no       

Unit                          Workload     Agent  Address         Ports          Message
admission-webhook/0*          active       idle   192.168.37.229                 
argo-controller/0*            active       idle   192.168.36.156                 
dex-auth/0*                   active       idle   192.168.26.171                 
envoy/0*                      active       idle   192.168.8.112   9090,9901/TCP  
istio-ingressgateway/0*       active       idle   192.168.58.32                  
istio-pilot/0*                active       idle   192.168.12.164                 
jupyter-controller/0*         active       idle   192.168.27.57                  
jupyter-ui/0*                 active       idle   192.168.16.191                 
katib-controller/0*           active       idle   192.168.0.240   443,8080/TCP   
katib-db-manager/0*           active       idle   192.168.12.138                 
katib-db/0*                   active       idle   192.168.46.0                   Primary
katib-ui/0*                   active       idle   192.168.16.234                 
kfp-api/0*                    active       idle   192.168.31.63                  
kfp-db/0*                     active       idle   192.168.23.114                 Primary
kfp-metadata-writer/0*        active       idle   192.168.50.121                 
kfp-persistence/0*            active       idle   192.168.54.111                 
kfp-profile-controller/0*     active       idle   192.168.54.143                 
kfp-schedwf/0*                active       idle   192.168.35.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:19)                  
kfp-ui/0*                     active       idle   [19](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:20)2.168.30.99                  
kfp-viewer/0*                 active       idle   192.168.43.232                 
kfp-viz/0*                    active       idle   192.168.22.39                  
knative-eventing/0*           active       idle   192.168.54.46                  
knative-operator/0*           active       idle   192.168.35.22                  
knative-serving/0*            active       idle   192.168.33.70                  
kserve-controller/0*          active       idle   192.168.57.96                  
kubeflow-dashboard/0*         active       idle   192.168.52.34                  
kubeflow-profiles/0*          active       idle   192.168.55.167                 
kubeflow-roles/0*             active       idle   192.168.51.155                 
kubeflow-volumes/0*           active       idle   192.168.56.236  5000/TCP       
metacontroller-operator/0*    active       idle   192.168.9.94                   
minio/0*                      active       idle   192.168.42.27   9000-9001/TCP  
mlmd/0*                       active       idle   192.168.55.144  8080/TCP       
oidc-gatekeeper/0*            active       idle   192.168.31.159                 
pvcviewer-operator/0*         maintenance  idle   192.168.33.17                  Reconciling charm: executing component kubernetes:auth-and-crds
seldon-controller-manager/0*  active       idle   192.168.4.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:13:21)2                  
tensorboard-controller/0*     active       idle   192.168.1.246                  
tensorboards-web-app/0*       active       idle   192.168.12.146                 
training-operator/0*          active       idle   192.168.25.46

Relevant Log Output

=================================== FAILURES ===================================
_____________________ test_notebook[training-integration] ______________________

test_notebook = '/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/notebooks/training/training-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))

        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)

        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"

        try:
            log.info(f"Running {os.path.basename(test_notebook)}...")
>           output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}})

/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/test_notebooks.py:45: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:100: in preprocess
    self.preprocess_cell(cell, resources, index)
/opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:121: in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
/opt/conda/lib/python3.8/site-packages/jupyter_core/utils/__init__.py:166: in wrapped
    return loop.run_until_complete(inner)
/opt/conda/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
    return future.result()
/opt/conda/lib/python3.8/site-packages/nbclient/client.py:1021: in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x7f7173fad940>
cell = {'cell_type': 'code', 'execution_count': 37, 'metadata': {'execution': {'iopub.status.busy': '2024-05-28T11:45:41.8652...de a master replica type\nprint_training_logs(client, PADDLEJOB_NAME, container=PADDLEJOB_CONTAINER, is_master=False)'}
cell_index = 66
exec_reply = {'buffers': [], 'content': {'ename': 'RuntimeError', 'engine_info': {'engine_id': -1, 'engine_uuid': 'a3a2e056-a7a6-4c...e, 'engine': 'a3a2e056-a7a6-4ca3-8c13-9bc98e526670', 'started': '2024-05-28T11:45:41.865720Z', 'status': 'error'}, ...}

    async def _check_raise_for_error(
        self, cell: NotebookNode, cell_index: int, exec_reply: t.Optional[t.Dict]
    ) -> None:

        if exec_reply is None:
            return None

        exec_reply_content = exec_reply['content']
        if exec_reply_content['status'] != 'error':
            return None

        cell_allows_errors = (not self.force_raise_errors) and (
            self.allow_errors
            or exec_reply_content.get('ename') in self.allow_error_names
            or "raises-exception" in cell.metadata.get("tags", [])
        )
        await run_hook(
            self.on_cell_error, cell=cell, cell_index=cell_index, execute_reply=exec_reply
        )
        if not cell_allows_errors:
>           raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
E           nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
E           ------------------
E           # set is_master to False because this example does not include a master replica type
E           print_training_logs(client, PADDLEJOB_NAME, container=PADDLEJOB_CONTAINER, is_master=False)
E           ------------------
E           
E           ---------------------------------------------------------------------------
E           ApiException                              Traceback (most recent call last)
E           File /opt/conda/lib/python3.8/site-packages/kubeflow/training/api/training_client.py:574, in TrainingClient.get_job_logs(self, name, namespace, is_master, replica_type, replica_index, container, follow, timeout)
E               573 try:
E           --> 574     pod_logs = self.core_api.read_namespaced_pod_log(
E               575 pod,namespace,container=container
E               576 )
E               577     logging.info("The logs of pod %s:\n %s", pod, pod_logs)
E           
E           File /opt/conda/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py:23957, in CoreV1Api.read_namespaced_pod_log(self, name, namespace, **kwargs)
E             23956 kwargs['_return_http_data_only'] = True
E           > 23957 return self.read_namespaced_pod_log_with_http_info(name,namespace,**kwargs)
E           
E           File /opt/conda/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py:24076, in CoreV1Api.read_namespaced_pod_log_with_http_info(self, name, namespace, **kwargs)
E             24074 auth_settings = ['BearerToken']  # noqa: E501
E           > 24076 return self.api_client.call_api(
E             24077 '/api/v1/namespaces/{namespace}/pods/{name}/log','GET',
E             24078 path_params,
E             24079 query_params,
E             24080 header_params,
E             24081 body=body_params,
E             24082 post_params=form_params,
E             24083 files=local_var_files,
E             24084 response_type='str',# noqa: E501
E             24085 auth_settings=auth_settings,
E             24086 async_req=local_var_params.get('async_req'),
E             24087 _return_http_data_only=local_var_params.get('_return_http_data_only'),# noqa: E501
E             24088 _preload_content=local_var_params.get('_preload_content',True),
E             24089 _request_timeout=local_var_params.get('_request_timeout'),
E             24090 collection_formats=collection_formats)
E           
E           File /opt/conda/lib/python3.8/site-packages/kubernetes/client/api_client.py:348, in ApiClient.call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_type, auth_settings, async_req, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host)
E               347 if not async_req:
E           --> 348     return self.__call_api(resource_path,method,
E               349 path_params,query_params,header_params,
E               350 body,post_params,files,
E               351 response_type,auth_settings,
E               352 _return_http_data_only,collection_formats,
E               353 _preload_content,_request_timeout,_host)
E               355 return self.pool.apply_async(self.__call_api, (resource_path,
E               356                                                method, path_params,
E               357                                                query_params,
E              (...)
E               365                                                _request_timeout,
E               366                                                _host))
E           
E           File /opt/conda/lib/python3.8/site-packages/kubernetes/client/api_client.py:180, in ApiClient.__call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_type, auth_settings, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host)
E               179 # perform request and return response
E           --> 180 response_data = self.request(
E               181 method,url,query_params=query_params,headers=header_params,
E               182 post_params=post_params,body=body,
E               183 _preload_content=_preload_content,
E               184 _request_timeout=_request_timeout)
E               186 self.last_response = response_data
E           
E           File /opt/conda/lib/python3.8/site-packages/kubernetes/client/api_client.py:373, in ApiClient.request(self, method, url, query_params, headers, post_params, body, _preload_content, _request_timeout)
E               372 if method == "GET":
E           --> 373     return self.rest_client.GET(url,
E               374 query_params=query_params,
E               375 _preload_content=_preload_content,
E               376 _request_timeout=_request_timeout,
E               377 headers=headers)
E               378 elif method == "HEAD":
E           
E           File /opt/conda/lib/python3.8/site-packages/kubernetes/client/rest.py:244, in RESTClientObject.GET(self, url, headers, query_params, _preload_content, _request_timeout)
E               242 def GET(self, url, headers=None, query_params=None, _preload_content=True,
E               243         _request_timeout=None):
E           --> 244     return self.request("GET",url,
E               245 headers=headers,
E               246 _preload_content=_preload_content,
E               247 _request_timeout=_request_timeout,
E               248 query_params=query_params)
E           
E           File /opt/conda/lib/python3.8/site-packages/kubernetes/client/rest.py:238, in RESTClientObject.request(self, method, url, query_params, headers, body, post_params, _preload_content, _request_timeout)
E               237 if not 200 <= r.status <= 299:
E           --> 238     raise ApiException(http_resp=r)
E               240 return r
E           
E           ApiException: (400)
E           Reason: Bad Request
E           HTTP response headers: HTTPHeaderDict({'Audit-Id': 'f6750967-a265-451a-a4b0-70f2055f0[429](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:11:430)', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 28 May 2024 11:45:41 GMT', 'Content-Length': '227'})
E           HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \"paddle\" in pod \"paddle-simple-cpu-worker-0\" is waiting to start: trying and failing to pull image","reason":"BadRequest","code":400}
E           
E           
E           
E           During handling of the above exception, another exception occurred:
E           
E           RuntimeError                              Traceback (most recent call last)
E           Cell In[37], line 2
E                 1 # set is_master to False because this example does not include a master replica type
E           ----> 2 print_training_logs(client,PADDLEJOB_NAME,container=PADDLEJOB_CONTAINER,is_master=False)
E           
E           Cell In[3], line 2, in print_training_logs(client, job_name, container, is_master)
E                 1 def print_training_logs(client, job_name: str, container: str, is_master: bool = True):
E           ----> 2     logs = client.get_job_logs(name=job_name,container=container,is_master=is_master)
E                 3     print(logs)
E           
E           File /opt/conda/lib/python3.8/site-packages/kubeflow/training/api/training_client.py:579, in TrainingClient.get_job_logs(self, name, namespace, is_master, replica_type, replica_index, container, follow, timeout)
E               577     logging.info("The logs of pod %s:\n %s", pod, pod_logs)
E               578 except Exception:
E           --> 579     raise RuntimeError(
E               580         f"Failed to read logs for pod {namespace}/{pod}"
E               581     )
E           
E           RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0
E           RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0

/opt/conda/lib/python3.8/site-packages/nbclient/client.py:915: CellExecutionError

During handling of the above exception, another exception occurred:

test_notebook = '/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/notebooks/training/training-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))

        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)

        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"

        try:
            log.info(f"Running {os.path.basename(test_notebook)}...")
            output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}})
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
>           pytest.fail(f"Notebook execution failed with {e.ename}: {e.evalue}")
E           Failed: Notebook execution failed with RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0

/tests/.worktrees/4ca5f8e7[474](https://github.com/canonical/bundle-kubeflow/actions/runs/9267535299/job/25494210720#step:11:475)193b125daecbd2dc157f3fe1ab017/tests/test_notebooks.py:50: Failed

Additional Context

No response

syncronize-issues-to-jira[bot] commented 1 month ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5748.

This message was autogenerated

orfeas-k commented 3 weeks ago

This should be closed by canonical/charmed-kubeflow-uats#68. If we hit this again, feel free to reopen.