Closed andreeamun closed 3 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6045.
This message was autogenerated
After inspecting a bit more the VM we saw that MicroK8s was down
sudo microk8s.kubectl get pods
E0724 08:29:37.769858 730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
E0724 08:29:37.770393 730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
E0724 08:29:37.771855 730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
E0724 08:29:37.772197 730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
E0724 08:29:37.773290 730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
The connection to the server 127.0.0.1:16443 was refused - did you specify the right host or port?
And also
sudo snap services microk8s
Service Startup Current Notes
microk8s.daemon-apiserver-kicker enabled active -
microk8s.daemon-apiserver-proxy enabled inactive -
microk8s.daemon-cluster-agent enabled active -
microk8s.daemon-containerd enabled active -
microk8s.daemon-etcd enabled inactive -
microk8s.daemon-flanneld enabled inactive -
microk8s.daemon-k8s-dqlite enabled active -
microk8s.daemon-kubelite enabled inactive -
Although, the messages from the CLI are not helpful in this case. Some examples:
dss status
[ERROR] Failed to retrieve status: [Errno 111] Connection refused.
dss list
[ERROR] Failed to list notebooks: [Errno 111] Connection refused.
We will need to improve the CLI to have better error messages when it can't talk to MicroK8s
So after some further inspection with microk8s inspect
we saw the following:
w9rn5_dss(849505fe-024c-48fd-85b2-743fbbb2ef9c): ErrImagePull: failed to pull and unpack image "docker.io/kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0": failed to copy: write /var/snap/microk8s/common/var/lib/containerd/io.containerd.content.v1.content/ingest/6a9341c65a7370df374dfad89239a94a207d1b40abfa04209649783692e23b93/data: no space left on device
So the issue in this case is that the machine run out of space. Closing this, although we should work on improving the CLI messages when it can't connect to microk8s
@kimwnasptd I think you have to update the tutorial to the reasonable disk allocation.
Bug Description
I created an instance on AWS and followed the tutorial. It all worked well, but when I tried to create a notebook, I received an error.
Command I ran:
dss create my-tensorflow-notebook --image=kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0
Error I got after a long waiting (more than 5 min):
[ERROR] Failed to create notebook my-tensorflow-notebook: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: IP address mismatch, certificate is not valid for '172.31.42.205'. (_ssl.c:1007).
However, then I saw the Notebook is created. Does it mean it works as expected and why do we received the error?
To Reproduce
Follow the tutorial published: https://documentation.ubuntu.com/data-science-stack/en/latest/tutorial/getting-started/
Environment
AWS t3.medium instance with DSS deployed
Relevant Log Output
I ran
sudo microk8s.kubectl get pods -n dss
and gave me the output from below. The notebooks seem stuck onAdditional Context
n/a