Closed orfeas-k closed 1 year ago
@orfeas-k It is best to resolve it with online debugging session on the VM that handles self-hosted runners. Please ping IS DevOps team to set it up.
We set up an online session with IS DevOps team where we tried inspecting the cluster to check why the notebooks are not created as expected.
Doing decsribe statefulset test-notebook
(no pod had been created yet), we noticed the following in the output events.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 5m10s (x17 over 10m) statefulset-controller create Pod test-notebook-0 in StatefulSet test-notebook failed error: Internal error occurred: failed calling webhook "namespace.sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.kubeflow.svc:443/inject?timeout=10s": Forbidden
Trying to get logs for jupyter-controller, we bumped into an issue when attempting to fetch logs from the running pods. More specifically, all requests fail with a Forbidden (403) error. This bugged us because we can access the K8s API server (e.g. list pods/nodes).
ubuntu@two-xlarge-38-d088fc92-5f6b-4195-b515-78d4dd627263:~$ microk8s kubectl logs sample-notebook-0
Error from server: Get "https://two-xlarge-38-d088fc92-5f6b-4195-b515-78d4dd627263:10250/containerLogs/default/sample-notebook-0/notebook": Forbidden
We believe that this is also related to the reason that the notebook is never created. Trying to debug this, we tried adding the node's IP to the no_proxy
env variable but that didn't change something. Talking later to MicroK8s team, we got the context that this should be because apiserver tries to talk to kubelet (port 10250) to get the logs, but goes through the proxy because of the global proxy configs. The solution would be (at least to the kubectl logs
problem) to:
two-xlarge-38-d088fc92-5f6b-4195-b515-78d4dd627263
in the no_proxy
varhttp_proxy
, https_proxy
, etc in /var/snap/microk8s/current/args/containerd-env
instead of /etc/environment
.We 'll try the above in the second session tomorrow.
During our second session, we SSHed into the self-hosted runner again and these were our findings:
.svc
value to the no_proxy
environment variableno_proxy
environment variable doesn't mean running export no_proxy=...
. We need to modify the environment variables from the place that they propagate to MicroK8s (in this case etc/environment
) and then restart MicroK8s (we used sudo snap restart microk8s.daemon-kubelite
in order to avoid restarting and wait for the whole cluster). In the spirit of documentation, we used the following commands in order to check the environment variables propagated to MicroK8s
ps -fea | grep kubelite
cat /proc/<pid>/environ | tr '\0' '\n'
The above happened because MicroK8s was trying to talk to Kubelet through the proxy and the proxy would not allow it.
Regarding Selenium tests and notebook creation, with the above issue resolved, creating a notebook in the admin
namespace through the CLI also worked (while before it didn't). Now, we 're waiting for a self-hosted runner with those values in its default no_proxy
variable to be spun up and try to run our tests there in order to verify tests can be run error-free there.
Once we confirm this, we can expect those changes to be implemented in all self-hosted runners in next one week or two.
As it looks, this resolved the issue and Selenium tests managed to create the notebook pod. Tests may have failed but the issue described here has been resolved.
Issue
Selenium tests for 1.7 release fail to create a notebook when running in self-hosted runners. Here are two failed runs 1 and 2. Both of the runs failed with
Line 132 is waiting for
CONNECT
button of the notebook just created to be clickable.Here is the screenshot Selenium took on failure:
Cause
It seems like the failure is that the notebook pod is not being created. Running the same tests locally and in an AWS EC2 instance (type t3.2xlarge), the notebook is created normally and the tests pass successfully, so this looks like an issue with the cluster.
Reproduce
Run the full bundle tests including the Selenium tests. We 'll remove them from the tests workflow (
full-bundle-tests.yaml
) as a temporary work, so make sure to use a branch that checkouts before the commit removing those.Solution
In order to debug this, we’ll need SSH to the self-hosted test-runner machine, but this has not been granted so we’re blocked for now.