canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

Kubeflow Microk8s deployment issue #454

Closed gtrkiller closed 2 years ago

gtrkiller commented 2 years ago

Hello, I am using Ubuntu 20.04, Microk8s 1.21 and the latest stable version of juju. This installation is being made on my local machine.

Whenever I try to follow the kubeflow quickstart tutorial (https://charmed-kubeflow.io/docs/quickstart) I can execute all the commands without any further issues. The thing is, when I watch juju status, there are some charms that won't get deployed because of an error pulling the image. I have tried tinkering with many things, but the one thing that changed everything was to change microk8s' DNS servers. The DNS Server that has given me less errors so far has been cloudflare, with google DNS (for example) being a lot more problematic. I will leave attached some screenshots of my machine's resolv.conf, environment file, example pod description (they al have the same error) and two juju status screenshots as well (with Cloudflare & Google DNS in this case) so you can see the difference. All kubeflow installations I tried had different charms failing as well... even with the same configuration. for example, if you see the first Cloudflare juju status screenshot, you can see 4 charms with errors, but on the second cloudflare screenshot (different installation from scratch) there were only two.

Resolv.conf and environment files:

Screenshot from 2022-04-22 11-22-58 Screenshot from 2022-04-22 11-23-13

Juju status (google DNS installation):

Screenshot from 2022-04-21 11-44-36

Juju status (1st cloudflare installation):

Screenshot from 2022-04-22 11-11-20

Juju status (2nd cloudflare installation):

Screenshot from 2022-04-21 18-55-14

Example describe pod screenshot:

Screenshot from 2022-04-22 11-15-31

All the juju status SHs have been taken when the installation gets stuck after several minutes (120+). I have tried to configure a proxy also, but that didn't work. I should note that the image pull back errors you will see in the screenshots are always triggered by a failed size verification (it can be seen on the describe pod SH) and also, I should note that I have tried several installation of all 3 kubeflow bundles, and they all have the same problem for me.

Juju-crashdump does not detect any machines on the kubeflow model.

Result of mtr --report --tcp --port 443 registry.jujucharms.com command:

Screenshot from 2022-04-26 16-01-08

gtrkiller commented 2 years ago

Contents of the juju crashdump: https://drive.google.com/file/d/1EFVbF9xyoxfhoyOgE0ROD6Ki01ap1E9H/view?usp=sharing

ca-scribner commented 2 years ago

Transferred this to bundle-kubeflow, but I don't know if this is the right place either. This feels more general than kubeflow, but not sure where to file this

ca-scribner commented 2 years ago

We believe this issue has been fixed on the image repo server side, so I'm closing this. But if this comes up again, please reopen the issue. Thanks!