canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

microk8s.enable kubeflow hangs #88

Closed ycheng closed 4 years ago

ycheng commented 5 years ago

install microk8s via

sudo snap install --classic microk8s --channel=1.15/edge/kubeflow

I got rev 802 of microk8s. With command "microk8s.enable kubeflow" it hangs with the following out for more than 10 mins.


Fetching latest public cloud list...
Updated your list of public clouds with 8 cloud regions added:
    added cloud region:
        - aws-gov/us-gov-east-1
        - aws/ap-east-1
        - aws/ap-northeast-3
        - aws/eu-north-1
        - aws/me-south-1
        - azure/francesouth
        - azure/southafricanorth
        - azure/southafricawest
Creating Juju controller "uk8s" on microk8s/localhost
Creating k8s resources for controller "controller-uk8s"
knkski commented 5 years ago

Hmm, not sure what might be causing this issue. I don't think we currently have a way of passing the --debug flag down to the Juju commands from microk8s.enable kubeflow, which would help debug this (though @ktsakalozos may correct me there), so I'll need to work on getting that in there. In the meantime, can you try running juju --debug deploy kubeflow, to see if that outputs anything useful?

ycheng commented 5 years ago

re-test again today, now microk8s.enable kubeflow can finish running.

However, as I try to create notebook, it failed to create.

BTW, kubeflow 0.6.2 is release, and currently 1.15/edge/kubeflow still use kubeflow v0.5. Steps in https://ubuntu.com/kubeflow/install can properly install kubeflow 0.6.

knkski commented 5 years ago

@ycheng: Can you list the steps you took to create the notebook, and how it failed? Additionally, can you attach output from these commands?

microk8s.kubectl logs --tail 1000 --all-containers -l juju-app=jupyter-controller
microk8s.kubectl logs --tail 1000 --all-containers -l juju-app=jupyter-web
ycheng commented 5 years ago

@knkski:

installation: snap core: r7396 microk8s: v1.15.3, r802, channel: 1.15/edge/kubeflow

Steps microk8s.reset microk8s.enable kubeflow microk8s.kubectl get po -n kubeflow => make sure all pod are Running. kubectl get svc -n kubeflow | grep ambassador => get the ip of ambassador, open browser http://ip/ to go the main ui. Choose Notebooks from the left side menu Click New Server Fill in the server name, nothing else, click "Spawin" in the buttom of the page. "No Status Available" for the new created server

Both command ("microk8s.kubectl logs ....") output nothing.

/var/log/pods/kubeflow_jupyter-controller-operator-0_d4811256-5f33-4613-9068-4792c179c3ae/juju-operator/ and get 0.log as jupyter-controller.log /var/log/pods/kubeflow_jupyter-web-7979d96ff9-2z58r_c39c2161-4c9f-4aac-b933-4d560bbfc978/jupyterhub/ and get 0.log as jupyter-web.log

jupyter-web.log jupyter-controller.log 2019-09-20 21-01-52_screenshot

knkski commented 5 years ago

@ycheng: can you try switching microk8s to the 1.16/edge/kubeflow channel and trying again?

ycheng commented 5 years ago

@knkski, just try today. microk8s is r946. it need user name and password to login. do you know what's the default one?

knkski commented 5 years ago

@ycheng: you can find the username / password to log into the kubeflow dashboard with these two commands:

juju config ambassador-auth username
juju config ambassador-auth password
knkski commented 5 years ago

@ycheng: Is this working for you then? Or is it still hanging when you run microk8s.enable kubeflow? If it is still an issue for you, can you run switch to the latest version of microk8s edge with sudo snap switch microk8s --channel edge && sudo snap refresh microk8s, and then post the output from KUBEFLOW_DEBUG=true microk8s.enable kubeflow?

ycheng commented 5 years ago

@knkski I can log in now. While try to create a notebook, it shows an error message

Warning!notebooks.kubeflow.org is forbidden: User "system:serviceaccount:kubeflow:default" cannot list resource "notebooks" in API group "kubeflow.org" in the namespace "kubeflow"

knkski commented 5 years ago

@ycheng: Sorry about that. I've got a fix in the edge bundle, but in the meantime, you could try running microk8s.disable rbac, which should fix that issue.

ycheng commented 4 years ago

@knkski hi, I reinstall microk8s from edge and got microk8s r1056 + core r8038

microk8s.enable kubeflow failed with log attached.

microk8s-edge-1056.log

ktsakalozos commented 4 years ago

@ycheng we recently (yesterday) pushed a patch [1] to address this. Could you try reinstalling from edge?

[1] https://github.com/ubuntu/microk8s/pull/793

ycheng commented 4 years ago

microk8s r1071:

03:42:21 INFO juju.util.exec exec.go:209 run result: exit status 1 ERROR The microk8s user group is created during the microk8s snap installation. Users in that group are granted access to microk8s commands and this is needed for Juju to be able to interact with microk8s.

Add yourself to that group before trying again: sudo usermod -a -G microk8s root

03:42:21 DEBUG cmd supercommand.go:519 error stack: /build/juju/parts/juju/go/src/github.com/juju/juju/caas/kubernetes/provider/cloud.go:337: The microk8s user group is created during the microk8s snap installation. Users in that group are granted access to microk8s commands and this is needed for Juju to be able to interact with microk8s.

Add yourself to that group before trying again: sudo usermod -a -G microk8s root

/build/juju/parts/juju/go/src/github.com/juju/juju/caas/kubernetes/provider/cloud.go:286: /build/juju/parts/juju/go/src/github.com/juju/juju/cmd/juju/commands/bootstrap.go:996: /build/juju/parts/juju/go/src/github.com/juju/juju/cmd/juju/commands/bootstrap.go:575:

Command '('microk8s-juju.wrapper', '--debug', 'bootstrap', 'microk8s', 'uk8s')' returned non-zero exit status 1 Failed to enable kubeflow

ktsakalozos commented 4 years ago

@wallyworld has already a fix for this issue and it should be available soon.

ycheng commented 4 years ago

it seems microk8s r1077 still failed with the same error. Did you have it test pass?

ktsakalozos commented 4 years ago

@ycheng the error you see if from the juju client. The microk8s.enable kubeflow addon uses for now the juju client from the snap edge channel (https://github.com/ubuntu/microk8s/blob/master/microk8s-resources/actions/enable.juju.sh#L13). @wallyworld may know more on when the fix will land there or if we should be using a different channel. Thanks.

knkski commented 4 years ago

@ycheng: Are you still running into this issue?

ricpet commented 4 years ago

hi all, I am getting this error:

Revoked:false Label:admin Invalid:false InvalidReason:}]}
18:45:28 INFO  juju.util.exec exec.go:209 run result: exit status 1
ERROR microk8s:
  running: false

18:45:28 DEBUG cmd supercommand.go:519 error stack:
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/caas/kubernetes/provider/cloud.go:384: microk8s:
  running: false

/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/caas/kubernetes/provider/cloud.go:349:
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/caas/kubernetes/provider/cloud.go:286:
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/cmd/juju/commands/bootstrap.go:996:
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/cmd/juju/commands/bootstrap.go:575:

Command '('microk8s-juju.wrapper', '--debug', 'bootstrap', 'microk8s', 'uk8s')' returned non-zero exit status 1
Failed to enable kubeflow

anyone fixed this?

knkski commented 4 years ago

@ricpet: This may be a race condition. Can you try running microk8s.disable kubeflow; microk8s.enable kubeflow to see if you run into the same error?

ricpet commented 4 years ago

@knkski thanks for your reply. I tried but nothing, still the same issue.

ricpet commented 4 years ago

I actually managed to fix that issue (there was some conflict with an old installation), however right now, I get this:

Kubeflow could not be enabled:
Creating Juju controller "uk8s" on microk8s/localhost
Creating k8s resources for controller "controller-uk8s"
ERROR failed to bootstrap model: creating controller stack for controller: creating statefulset for controller: timed out waiting for controller pod: pending:  -

Command '('microk8s-juju.wrapper', 'bootstrap', 'microk8s', 'uk8s')' returned non-zero exit status 1
Failed to enable kubeflow
ricpet commented 4 years ago

I actually managed to fix that issue (there was some conflict with an old installation), however right now, I get this:

Kubeflow could not be enabled:
Creating Juju controller "uk8s" on microk8s/localhost
Creating k8s resources for controller "controller-uk8s"
ERROR failed to bootstrap model: creating controller stack for controller: creating statefulset for controller: timed out waiting for controller pod: pending:  -

Command '('microk8s-juju.wrapper', 'bootstrap', 'microk8s', 'uk8s')' returned non-zero exit status 1
Failed to enable kubeflow
knkski commented 4 years ago

@ricpet: What was the fix involved in the previous error you were running into?

I haven't seen this new error before. Can you try again with the KUBEFLOW_DEBUG=true environment variable set? Offhand, it looks like a networking issue, which you might have if you're running behind a proxy/firewall/etc.

ricpet commented 4 years ago

hi @knkski the error is actually the same as before:

19:09:21 INFO  juju.util.exec exec.go:209 run result: exit status 1
ERROR microk8s:
  running: false

19:09:21 DEBUG cmd supercommand.go:519 error stack:
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/caas/kubernetes/provider/cloud.go:384: microk8s:
  running: false

/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/caas/kubernetes/provider/cloud.go:349:
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/caas/kubernetes/provider/cloud.go:286:
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/cmd/juju/commands/bootstrap.go:996:
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/cmd/juju/commands/bootstrap.go:575:

Command '('microk8s-juju.wrapper', '--debug', 'bootstrap', 'microk8s', 'uk8s')' returned non-zero exit status 1
ricpet commented 4 years ago

just to add more context... I am using Ubuntu 18.04 (desktop) and I installed microk8s following this link https://microk8s.io/ and kubeflow using (https://www.kubeflow.org/docs/other-guides/virtual-dev/getting-started-multipass/). When I enable kubeflow I get:

Enabling dns...
[sudo] password for USER:
Enabling storage...
Enabling dashboard...
Enabling ingress...
Enabling rbac...
Enabling juju...
Deploying Kubeflow...
Kubeflow could not be enabled:
ERROR microk8s:
  running: false

Command '('microk8s-juju.wrapper', 'bootstrap', 'microk8s', 'uk8s')' returned non-zero exit status 1
Failed to enable kubeflow
ktsakalozos commented 4 years ago

@ricpet it could be possible that the machine you are deploying kubeflow is running out of memory and the OS killed the apiserver while kubeflow was coming up. What are the specs of the machine (virtual or not) where MicroK8s runs on? The microk8s.inspect tarball has information we would need to debug this case. Thanks.

ricpet commented 4 years ago

hi @ktsakalozos thanks for your help. The specs of my machine are:

the outupt of microk8s.inspect is:

Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-flanneld is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-apiserver is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Service snap.microk8s.daemon-proxy is running
  Service snap.microk8s.daemon-kubelet is running
  Service snap.microk8s.daemon-scheduler is running
  Service snap.microk8s.daemon-controller-manager is running
  Service snap.microk8s.daemon-etcd is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster

 WARNING:  IPtables FORWARD policy is DROP. Consider enabling traffic forwarding with: sudo iptables -P FORWARD ACCEPT
The change can be made persistent with: sudo apt-get install iptables-persistent
WARNING:  Docker is installed.
File "/etc/docker/daemon.json" does not exist.
You should create it and add the following lines:
{
    "insecure-registries" : ["localhost:32000"]
}
and then restart docker with: sudo systemctl restart docker
Building the report tarball
  Report tarball is at /var/snap/microk8s/1173/inspection-report-20200226_104633.tar.gz
knkski commented 4 years ago

@ricpet: Can you attach the tarball that microk8s.inspect generated? It looks like it put it at /var/snap/microk8s/1173/inspection-report-20200226_104633.tar.gz

ricpet commented 4 years ago

@knkski here you go https://www.dropbox.com/s/haby9tzpudx4l8m/inspection-report-20200226_104633.tar.gz?dl=0

mikejmills commented 4 years ago

Same issue here

Ubuntu 18

Revoked:false Label:admin Invalid:false InvalidReason:}]} 13:52:10 INFO juju.util.exec exec.go:209 run result: exit status 1 ERROR microk8s: running: false

knkski commented 4 years ago

@ricpet, @mikejmills: If you switch to the edge version of microk8s, it includes a fix for this error:

# If you don't have it installed
sudo snap install microk8s --classic --edge

# If you have it installed
sudo snap switch microk8s --channel edge
sudo snap refresh microk8s

Note that with the edge version, you'll have to use the edge kubeflow bundle, so you'll need to enable microk8s like this:

KUBEFLOW_CHANNEL=edge microk8s.enable kubeflow

That requirement should disappear when microk8s 1.18 hits stable, which is targeted for this Thursday (March 26th).