canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

Selenium tests fail to create a notebook in self-hosted runners #671

Closed orfeas-k closed 1 year ago

orfeas-k commented 1 year ago

Issue

Selenium tests for 1.7 release fail to create a notebook when running in self-hosted runners. Here are two failed runs 1 and 2. Both of the runs failed with

  File "/home/ubuntu/github-runner/_work/bundle-kubeflow/bundle-kubeflow/tests-bundle/1.7/test_tutorial.py", line 132, in test_create_notebook
    WebDriverWait(driver, 800).until(
  File "/home/ubuntu/github-runner/_work/bundle-kubeflow/bundle-kubeflow/.tox/full_bundle_tests/lib/python3.10/site-packages/selenium/webdriver/support/wait.py", line 95, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 

Line 132 is waiting for CONNECT button of the notebook just created to be clickable.

Here is the screenshot Selenium took on failure:

test_create_notebook_08-17_12-22

Cause

It seems like the failure is that the notebook pod is not being created. Running the same tests locally and in an AWS EC2 instance (type t3.2xlarge), the notebook is created normally and the tests pass successfully, so this looks like an issue with the cluster.

Reproduce

Run the full bundle tests including the Selenium tests. We 'll remove them from the tests workflow (full-bundle-tests.yaml) as a temporary work, so make sure to use a branch that checkouts before the commit removing those.

Solution

In order to debug this, we’ll need SSH to the self-hosted test-runner machine, but this has not been granted so we’re blocked for now.

i-chvets commented 1 year ago

@orfeas-k It is best to resolve it with online debugging session on the VM that handles self-hosted runners. Please ping IS DevOps team to set it up.

orfeas-k commented 1 year ago

We set up an online session with IS DevOps team where we tried inspecting the cluster to check why the notebooks are not created as expected. Doing decsribe statefulset test-notebook (no pod had been created yet), we noticed the following in the output events.

Events:
  Type     Reason        Age                   From                    Message
  ----     ------        ----                  ----                    -------
  Warning  FailedCreate  5m10s (x17 over 10m)  statefulset-controller  create Pod test-notebook-0 in StatefulSet test-notebook failed error: Internal error occurred: failed calling webhook "namespace.sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.kubeflow.svc:443/inject?timeout=10s": Forbidden

Trying to get logs for jupyter-controller, we bumped into an issue when attempting to fetch logs from the running pods. More specifically, all requests fail with a Forbidden (403) error. This bugged us because we can access the K8s API server (e.g. list pods/nodes).

ubuntu@two-xlarge-38-d088fc92-5f6b-4195-b515-78d4dd627263:~$ microk8s kubectl logs sample-notebook-0
Error from server: Get "https://two-xlarge-38-d088fc92-5f6b-4195-b515-78d4dd627263:10250/containerLogs/default/sample-notebook-0/notebook": Forbidden

We believe that this is also related to the reason that the notebook is never created. Trying to debug this, we tried adding the node's IP to the no_proxy env variable but that didn't change something. Talking later to MicroK8s team, we got the context that this should be because apiserver tries to talk to kubelet (port 10250) to get the logs, but goes through the proxy because of the global proxy configs. The solution would be (at least to the kubectl logs problem) to:

We 'll try the above in the second session tomorrow.

orfeas-k commented 1 year ago

During our second session, we SSHed into the self-hosted runner again and these were our findings:

The above happened because MicroK8s was trying to talk to Kubelet through the proxy and the proxy would not allow it.

orfeas-k commented 1 year ago

Regarding Selenium tests and notebook creation, with the above issue resolved, creating a notebook in the admin namespace through the CLI also worked (while before it didn't). Now, we 're waiting for a self-hosted runner with those values in its default no_proxy variable to be spun up and try to run our tests there in order to verify tests can be run error-free there.

Once we confirm this, we can expect those changes to be implemented in all self-hosted runners in next one week or two.

orfeas-k commented 1 year ago

As it looks, this resolved the issue and Selenium tests managed to create the notebook pod. Tests may have failed but the issue described here has been resolved.