canonical / data-science-stack

Stack with machine learning tools needed for local development.
Apache License 2.0
18 stars 7 forks source link

Cannot create Notebook from AWS instance #151

Closed andreeamun closed 3 months ago

andreeamun commented 3 months ago

Bug Description

I created an instance on AWS and followed the tutorial. It all worked well, but when I tried to create a notebook, I received an error.

Command I ran: dss create my-tensorflow-notebook --image=kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0

Error I got after a long waiting (more than 5 min): [ERROR] Failed to create notebook my-tensorflow-notebook: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: IP address mismatch, certificate is not valid for '172.31.42.205'. (_ssl.c:1007).

However, then I saw the Notebook is created. Does it mean it works as expected and why do we received the error?

To Reproduce

Follow the tutorial published: https://documentation.ubuntu.com/data-science-stack/en/latest/tutorial/getting-started/

Environment

AWS t3.medium instance with DSS deployed

Relevant Log Output

I ran sudo microk8s.kubectl get pods -n dss and gave me the output from below. The notebooks seem stuck on

Screenshot 2024-07-24 at 00 42 01

Additional Context

n/a

syncronize-issues-to-jira[bot] commented 3 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6045.

This message was autogenerated

kimwnasptd commented 3 months ago

After inspecting a bit more the VM we saw that MicroK8s was down

sudo microk8s.kubectl get pods
E0724 08:29:37.769858  730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
E0724 08:29:37.770393  730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
E0724 08:29:37.771855  730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
E0724 08:29:37.772197  730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
E0724 08:29:37.773290  730501 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": dial tcp 127.0.0.1:16443: connect: connection refused
The connection to the server 127.0.0.1:16443 was refused - did you specify the right host or port?

And also

sudo snap services microk8s
Service                           Startup  Current   Notes
microk8s.daemon-apiserver-kicker  enabled  active    -
microk8s.daemon-apiserver-proxy   enabled  inactive  -
microk8s.daemon-cluster-agent     enabled  active    -
microk8s.daemon-containerd        enabled  active    -
microk8s.daemon-etcd              enabled  inactive  -
microk8s.daemon-flanneld          enabled  inactive  -
microk8s.daemon-k8s-dqlite        enabled  active    -
microk8s.daemon-kubelite          enabled  inactive  -
kimwnasptd commented 3 months ago

Although, the messages from the CLI are not helpful in this case. Some examples:

dss status
[ERROR] Failed to retrieve status: [Errno 111] Connection refused.
dss list
[ERROR] Failed to list notebooks: [Errno 111] Connection refused.

We will need to improve the CLI to have better error messages when it can't talk to MicroK8s

kimwnasptd commented 3 months ago

So after some further inspection with microk8s inspect we saw the following:

w9rn5_dss(849505fe-024c-48fd-85b2-743fbbb2ef9c): ErrImagePull: failed to pull and unpack image "docker.io/kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0": failed to copy: write /var/snap/microk8s/common/var/lib/containerd/io.containerd.content.v1.content/ingest/6a9341c65a7370df374dfad89239a94a207d1b40abfa04209649783692e23b93/data: no space left on device

So the issue in this case is that the machine run out of space. Closing this, although we should work on improving the CLI messages when it can't connect to microk8s

beliaev-maksim commented 3 months ago

@kimwnasptd I think you have to update the tutorial to the reasonable disk allocation.

image