canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
98 stars 48 forks source link

Test the UATs for the 1.9 release on Microk8s #808

Open ca-scribner opened 5 months ago

ca-scribner commented 5 months ago

Context

The UAT tests should be run on any new kubeflow bundle prior to release

What needs to get done

  1. execute the UATs on the kubeflow 1.9 release

Definition of Done

  1. UATs are passing for kubeflow 1.9 release
syncronize-issues-to-jira[bot] commented 5 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5276.

This message was autogenerated

DnPlas commented 2 weeks ago

UATs in 1.9/beta

Identified issues:

I ran UATs with tox -ve kubeflow-local, these are the results:

============================================================================ short test summary info =============================================================================
FAILED driver/test_kubeflow_workloads.py::test_kubeflow_workloads - Failed: Something went wrong while running Job test-kubeflow/test-kubeflow. Please inspect the attached logs for more info...
==================================================================== 1 failed, 1 passed in 1988.01s (0:33:08) ====================================================================
kubeflow-local: exit 1 (1989.54 seconds) /home/ubuntu/charmed-kubeflow-uats> pytest -vv --tb native /home/ubuntu/charmed-kubeflow-uats/driver/ -s --filter 'not mlflow' --model kubeflow pid=591250
  kubeflow-local: FAIL code 1 (1989.59=setup[0.05]+cmd[1989.54] seconds)
  evaluation failed :( (1989.68 seconds)

According to the logs, the Katib integration test is failing:

------------------------------ Captured log call -------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running katib-integration.ipynb...
ERROR    test_notebooks:test_notebooks.py:58 Cell In[8], line 8, in assert_experiment_succeeded(client, experiment)
      1 @retry(
      2     wait=wait_exponential(multiplier=2, min=1, max=10),
      3     stop=stop_after_attempt(30),
      4     reraise=True,
      5 )
      6 def assert_experiment_succeeded(client, experiment):
      7     """Wait for the Katib Experiment to complete successfully."""
----> 8     assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."
AssertionError: Katib Experiment was not successful.
=========================== short test summary info ============================
FAILED test_notebooks.py::test_notebook[katib-integration] - Failed: AssertionError: Katib Experiment was not successful.
============ 1 failed, 4 passed, 4 deselected in 1940.13s (0:32:20) ============
FAILED
------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------
INFO     test_kubeflow_workloads:test_kubeflow_workloads.py:82 Deleting Profile test-kubeflow...
INFO     httpx:_client.py:1013 HTTP Request: DELETE https://172.31.15.25:16443/apis/kubeflow.org/v1/profiles/test-kubeflow "HTTP/1.1 200 OK"
INFO     test_kubeflow_workloads:test_kubeflow_workloads.py:141 Deleting Job test-kubeflow/test-kubeflow...
INFO     httpx:_client.py:1013 HTTP Request: DELETE https://172.31.15.25:16443/apis/batch/v1/namespaces/test-kubeflow/jobs/test-kubeflow "HTTP/1.1 200 OK"

Looking a bit more into the logs, I can see the following:

============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.8
cachedir: .pytest_cache
rootdir: /tests
configfile: pytest.ini
plugins: anyio-3.6.2
collecting ... collected 9 items / 4 deselected / 5 selected

test_notebooks.py::test_notebook[katib-integration]
-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running katib-integration.ipynb...
ERROR    test_notebooks:test_notebooks.py:58 Cell In[8], line 8, in assert_experiment_succeeded(client, experiment)
      1 @retry(
      2     wait=wait_exponential(multiplier=2, min=1, max=10),
      3     stop=stop_after_attempt(30),
      4     reraise=True,
      5 )
      6 def assert_experiment_succeeded(client, experiment):
      7     """Wait for the Katib Experiment to complete successfully."""
----> 8     assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."
AssertionError: Katib Experiment was not successful.
FAILED                                                                   [ 20%]
test_notebooks.py::test_notebook[kfp-v1-integration]
-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running kfp-v1-integration.ipynb...
PASSED                                                                   [ 40%]
test_notebooks.py::test_notebook[kfp-v2-integration]
-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running kfp-v2-integration.ipynb...
PASSED                                                                   [ 60%]
test_notebooks.py::test_notebook[kserve-integration]
-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running kserve-integration.ipynb...
PASSED                                                                   [ 80%]
test_notebooks.py::test_notebook[training-integration]
-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
PASSED                                                                   [100%]
=================================== FAILURES ===================================
_______________________ test_notebook[katib-integration] _______________________

test_notebook = '/tests/notebooks/katib/katib-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))

        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)

        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"

        try:
            log.info(f"Running {os.path.basename(test_notebook)}...")
            output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}})
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with {e.ename}: {e.evalue}")

        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Katib Experiment was not successful.

/tests/test_notebooks.py:59: Failed

Preliminary tests for beta

  1. Deployed juju kubeflow --channel 1.9/beta --trust
  2. Configured dex-auth and oidc-gatekeeper's public-url = http://dex-auth.kubeflow.svc:5556
  3. Configured dex-auth's static-username and static-password
  4. Waited for about 10 minutes and checked the status of the model:
ubuntu@ip-172-31-15-25:~$ juju status
Model     Controller  Cloud/Region        Version  SLA          Timestamp
kubeflow  uk8s-343    microk8s/localhost  3.4.4    unsupported  20:23:35Z

App                        Version                  Status  Scale  Charm                    Channel       Rev  Address         Exposed  Message
admission-webhook                                   active      1  admission-webhook        latest/beta   328  10.152.183.124  no
argo-controller                                     active      1  argo-controller          latest/beta   526  10.152.183.183  no
dex-auth                                            active      1  dex-auth                 latest/beta   507  10.152.183.141  no
envoy                                               active      1  envoy                    latest/beta   231  10.152.183.126  no
istio-ingressgateway                                active      1  istio-gateway            latest/beta  1048  10.152.183.69   no
istio-pilot                                         active      1  istio-pilot              latest/beta  1013  10.152.183.23   no
jupyter-controller                                  active      1  jupyter-controller       latest/beta  1002  10.152.183.175  no
jupyter-ui                                          active      1  jupyter-ui               latest/beta   925  10.152.183.41   no
katib-controller                                    active      1  katib-controller         latest/beta   690  10.152.183.35   no
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable    153  10.152.183.129  no
katib-db-manager                                    active      1  katib-db-manager         latest/beta   653  10.152.183.50   no
katib-ui                                            active      1  katib-ui                 latest/beta   657  10.152.183.217  no
kfp-api                                             active      1  kfp-api                  latest/beta  1466  10.152.183.91   no
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable    153  10.152.183.80   no
kfp-metadata-writer                                 active      1  kfp-metadata-writer      latest/beta   524  10.152.183.59   no
kfp-persistence                                     active      1  kfp-persistence          latest/beta  1473  10.152.183.42   no
kfp-profile-controller                              active      1  kfp-profile-controller   latest/beta  1431  10.152.183.130  no
kfp-schedwf                                         active      1  kfp-schedwf              latest/beta  1484  10.152.183.180  no
kfp-ui                                              active      1  kfp-ui                   latest/beta  1467  10.152.183.229  no
kfp-viewer                                          active      1  kfp-viewer               latest/beta  1499  10.152.183.77   no
kfp-viz                                             active      1  kfp-viz                  latest/beta  1417  10.152.183.219  no
knative-eventing                                    active      1  knative-eventing         latest/beta   441  10.152.183.111  no
knative-operator                                    active      1  knative-operator         latest/beta   416  10.152.183.134  no
knative-serving                                     active      1  knative-serving          latest/beta   442  10.152.183.75   no
kserve-controller                                   active      1  kserve-controller        latest/beta   397  10.152.183.132  no
kubeflow-dashboard                                  active      1  kubeflow-dashboard       latest/beta   600  10.152.183.32   no
kubeflow-profiles                                   active      1  kubeflow-profiles        latest/beta   393  10.152.183.221  no
kubeflow-roles                                      active      1  kubeflow-roles           latest/beta   225  10.152.183.150  no
kubeflow-volumes                                    active      1  kubeflow-volumes         latest/beta   314  10.152.183.28   no
metacontroller-operator                             active      1  metacontroller-operator  latest/beta   280  10.152.183.61   no
minio                      res:oci-image@5102166    active      1  minio                    latest/beta   334  10.152.183.21   no
mlmd                                                active      1  mlmd                     latest/beta   201  10.152.183.197  no
oidc-gatekeeper                                     active      1  oidc-gatekeeper          latest/beta   396  10.152.183.43   no
pvcviewer-operator                                  active      1  pvcviewer-operator       latest/beta   118  10.152.183.253  no
seldon-controller-manager                           active      1  seldon-core              latest/beta   691  10.152.183.236  no
tensorboard-controller                              active      1  tensorboard-controller   latest/beta   307  10.152.183.18   no
tensorboards-web-app                                active      1  tensorboards-web-app     latest/beta   295  10.152.183.211  no
training-operator                                   active      1  training-operator        latest/beta   483  10.152.183.215  no
  1. Using the LB tried logging into the dashboard:

image

  1. I was able to log in and navigate the dashboard (all components seem to be working)
image
  1. I tried creating a notebook, connect to it and use it - it works
  2. I tried creating a Pipelines experiment, create a run and a recurring run - it works
  3. Looked into volumes and using the pvc viewer, I was able to navigate directories

Preliminary tests indicate the 1.9/beta bundle works just fine.