canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
102 stars 49 forks source link

Cannot connect to dashboard (403: forbidden) after deployment #383

Closed ca-scribner closed 3 years ago

ca-scribner commented 3 years ago

After doing a microk8s enable kubeflow to deploy the bundle (trying both full and lite bundles) on a local machine, the Kubeflow web portal was inaccessible returning a 403 (forbidden) and some juju applications weren't working as expected. It feels like there's both microk8s and kubeflow bundle problems being encountered - for the microk8s specific ones I'll open something else there as well, but I think I worked around them and have only kubeflow-bundle problems remaining.

During the deployment, I hit a few snags:

  1. when deploying the istio-ingressgateway, the app was stuck waiting with status Waiting for Istio Pilot information. I resolved this (I think) by step 7 here then waiting ~10 minutes

  2. when all applications were up, microk8s could not identify hostname automatically, stating WARNING: Unable to determine hostname, defaulting to localhost and The dashboard is available at http://localhost. I manually retrieved the istio-ingressgateway IP and tried logging into the dashboard there (INGRESSIP.nip.io) but received a 403 error

  3. After doing step 1. I left the deployment overnight to resolve. When I returned in the morning the oidc-gatekeeper/0 unit appeared stuck in a crash loop unable to get a pod. Not sure if this happened immediately after deployment or somewhere overnight. juju status shows:

    oidc-gatekeeper/0*            terminated  executing  10.1.182.113  8080/TCP                                                                                               (leader-settings-changed) unit stopped by the cloud
    oidc-gatekeeper/1             waiting     idle       10.1.182.119  8080/TCP                                                                                               Waiting for leadership

    At other times, this has shown the agent being lost (see full juju status below)

  4. I think as a result of 2, dex-auth.public-url=http://localhost and oidc-gatekeeper.public-url=http://localhost which is likely wrong (see step 6). I updated those (juju config) to be INGRESSIP.nip.io. I think dex-auth accepted this and restarted, but oidc-gatekeeper did not appear to update properly.

Other resources:

App Version Status Scale Charm Store Channel Rev OS Address Message admission-webhook res:oci-image@1abb127 active 1 admission-webhook charmstore stable 10 kubernetes 10.152.183.19
argo-controller res:oci-image@c1746ae active 1 argo-controller charmstore stable 51 kubernetes
dex-auth res:oci-image@af9c1b3 active 1 dex-auth charmstore stable 60 kubernetes 10.152.183.13
istio-ingressgateway res:oci-image@89b5fe2 active 1 istio-ingressgateway charmstore stable 20 kubernetes 10.64.140.43
istio-pilot res:oci-image@e3e03b3 active 1 istio-pilot charmstore stable 20 kubernetes 10.152.183.233
jupyter-controller res:oci-image@8c7be42 active 1 jupyter-controller charmstore stable 55 kubernetes
jupyter-ui res:oci-image@af3b8ce active 1 jupyter-ui charmstore stable 9 kubernetes 10.152.183.133
kfp-api res:oci-image@8e60840 active 1 kfp-api charmstore stable 10 kubernetes 10.152.183.220
kfp-db mariadb/server:10.3 active 1 mariadb-k8s charmstore stable 35 kubernetes 10.152.183.239
kfp-persistence res:oci-image@9338d08 active 1 kfp-persistence charmstore stable 7 kubernetes
kfp-schedwf res:oci-image@4ab6488 active 1 kfp-schedwf charmstore stable 7 kubernetes
kfp-ui res:oci-image@04a4348 active 1 kfp-ui charmstore stable 9 kubernetes 10.152.183.226
kfp-viewer res:oci-image@bae62bf active 1 kfp-viewer charmstore stable 7 kubernetes
kfp-viz res:oci-image@c90a581 active 1 kfp-viz charmstore stable 6 kubernetes 10.152.183.24
kubeflow-dashboard res:oci-image@126c9a9 active 1 kubeflow-dashboard charmstore stable 56 kubernetes 10.152.183.213
kubeflow-profiles res:profile-image@582b8eb active 1 kubeflow-profiles charmstore stable 52 kubernetes 10.152.183.108
minio res:oci-image@4707912 active 1 minio charmstore stable 55 kubernetes 10.152.183.118
mlmd res:oci-image@78eb66d active 1 mlmd charmstore stable 5 kubernetes 10.152.183.214
oidc-gatekeeper res:oci-image@9bb01f7 active 0/1 oidc-gatekeeper charmstore stable 53 kubernetes 10.152.183.144
pytorch-operator res:oci-image@08c3373 active 1 pytorch-operator charmstore stable 53 kubernetes
seldon-controller-manager res:oci-image@82fd029 active 1 seldon-core charmstore stable 50 kubernetes 10.152.183.69
tfjob-operator res:oci-image@3fabaf3 active 1 tfjob-operator charmstore stable 1 kubernetes

Unit Workload Agent Address Ports Message admission-webhook/0 active idle 10.1.182.77 443/TCP
argo-controller/0
active idle 10.1.182.114
dex-auth/6 active idle 10.1.182.124 5556/TCP
istio-ingressgateway/0
active idle 10.1.182.121 15020/TCP,80/TCP,443/TCP,15029/TCP,15030/TCP,15031/TCP,15032/TCP,15443/TCP,15011/TCP,8060/TCP,853/TCP
istio-pilot/0 active idle 10.1.182.90 8080/TCP,15010/TCP,15012/TCP,15017/TCP
jupyter-controller/0
active idle 10.1.182.88
jupyter-ui/0 active idle 10.1.182.83 5000/TCP
kfp-api/0
active idle 10.1.182.116 8888/TCP,8887/TCP
kfp-db/0 active idle 10.1.182.89 3306/TCP ready kfp-persistence/0 active idle 10.1.182.115
kfp-schedwf/0 active idle 10.1.182.104
kfp-ui/0
active idle 10.1.182.117 3000/TCP
kfp-viewer/0 active idle 10.1.182.111
kfp-viz/0 active idle 10.1.182.106 8888/TCP
kubeflow-dashboard/0
active idle 10.1.182.112 8082/TCP
kubeflow-profiles/0 active idle 10.1.182.105 8080/TCP,8081/TCP
minio/0
active idle 10.1.182.108 9000/TCP
mlmd/0 active idle 10.1.182.110 8080/TCP
oidc-gatekeeper/0 unknown lost 10.1.182.113 8080/TCP agent lost, see 'juju show-status-log oidc-gatekeeper/0' oidc-gatekeeper/1 unknown lost 10.1.182.119 8080/TCP agent lost, see 'juju show-status-log oidc-gatekeeper/1' pytorch-operator/0
active idle 10.1.182.107 8443/TCP
seldon-controller-manager/0* active idle 10.1.182.100 8080/TCP,4443/TCP
tfjob-operator/0 active idle 10.1.182.109 8443/TCP

* `microk8s kubectl get all --all-namespaces`

NAMESPACE NAME READY STATUS RESTARTS AGE metallb-system pod/speaker-hv4ll 1/1 Running 0 18h metallb-system pod/controller-559b68bfd8-zvhsk 1/1 Running 0 18h kube-system pod/hostpath-provisioner-5c65fbdb4f-6cz2m 1/1 Running 0 18h controller-uk8s pod/modeloperator-649944bf89-swqsg 1/1 Running 0 18h kubeflow pod/modeloperator-658b4b6c58-gn5fc 1/1 Running 0 18h kubeflow pod/admission-webhook-operator-0 1/1 Running 0 18h kubeflow pod/argo-controller-operator-0 1/1 Running 0 18h kubeflow pod/dex-auth-operator-0 1/1 Running 0 18h kubeflow pod/admission-webhook-795c896784-n5bcc 1/1 Running 0 18h kubeflow pod/jupyter-ui-operator-0 1/1 Running 0 18h kubeflow pod/istio-ingressgateway-operator-0 1/1 Running 0 18h kubeflow pod/istio-pilot-operator-0 1/1 Running 0 18h kubeflow pod/jupyter-controller-operator-0 1/1 Running 0 18h kubeflow pod/kfp-db-operator-0 1/1 Running 0 18h kubeflow pod/kfp-api-operator-0 1/1 Running 0 18h kubeflow pod/kfp-persistence-operator-0 1/1 Running 0 18h kubeflow pod/seldon-controller-manager-operator-0 1/1 Running 0 18h kubeflow pod/kubeflow-profiles-operator-0 1/1 Running 0 18h kubeflow pod/kfp-ui-operator-0 1/1 Running 0 18h kubeflow pod/kfp-schedwf-operator-0 1/1 Running 0 18h kubeflow pod/kfp-viz-operator-0 1/1 Running 0 18h kubeflow pod/minio-operator-0 1/1 Running 0 18h kubeflow pod/oidc-gatekeeper-operator-0 1/1 Running 0 18h kubeflow pod/pytorch-operator-operator-0 1/1 Running 0 18h kubeflow pod/tfjob-operator-operator-0 1/1 Running 0 18h kubeflow pod/kfp-viewer-operator-0 1/1 Running 0 18h kubeflow pod/kubeflow-dashboard-operator-0 1/1 Running 0 18h kubeflow pod/mlmd-operator-0 1/1 Running 0 18h kubeflow pod/jupyter-ui-6d87f8dc8-4j4nz 1/1 Running 0 18h kubeflow pod/jupyter-controller-5b9b44fdc4-brtxs 1/1 Running 0 18h kubeflow pod/kfp-schedwf-bc96bbbd6-dvspd 1/1 Running 0 18h kubeflow pod/minio-0 1/1 Running 0 18h kubeflow pod/kfp-viewer-6d96fbf466-khqq4 1/1 Running 0 18h kubeflow pod/kubeflow-dashboard-694d66ffb6-nxl2n 1/1 Running 0 17h kubeflow pod/kubeflow-profiles-7d54db8b75-qnnvf 2/2 Running 0 18h kubeflow pod/oidc-gatekeeper-748687b564-429zg 1/1 Running 0 17h kubeflow pod/kfp-persistence-648f685479-jbcwn 1/1 Running 2 17h kubeflow pod/istio-ingressgateway-957447478-cd8tr 1/1 Running 0 17h kubeflow pod/kfp-viz-67775f7888-z7zp9 1/1 Running 0 18h kubeflow pod/kfp-api-54dd7dc858-2xrgp 1/1 Running 0 17h kubeflow pod/kfp-ui-869494c98c-8m8wj 1/1 Running 0 17h kube-system pod/calico-kube-controllers-f7868dd95-47v6p 1/1 Running 0 18h kubeflow pod/kfp-db-0 1/1 Running 0 18h kubeflow pod/istio-pilot-7bfdbc474b-prtlk 1/1 Running 0 18h ingress pod/nginx-ingress-microk8s-controller-59dhj 1/1 Running 0 18h kube-system pod/coredns-7f9c69c78c-fffsj 1/1 Running 0 18h controller-uk8s pod/controller-0 2/2 Running 1 18h kubeflow pod/tfjob-operator-965d5c769-7gltp 1/1 Running 1 18h kubeflow pod/mlmd-0 1/1 Running 0 18h kube-system pod/calico-node-jn6gn 1/1 Running 0 18h kubeflow pod/pytorch-operator-568d56c769-pf2pj 1/1 Running 1 18h kubeflow pod/dex-auth-5f68f57bc9-jcb8w 1/1 Running 0 141m kubeflow pod/seldon-controller-manager-5c8fbffc67-hfhgt 0/1 CrashLoopBackOff 148 18h kubeflow pod/argo-controller-84468669d4-h4x6g 0/1 CrashLoopBackOff 148 17h

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default service/kubernetes ClusterIP 10.152.183.1 443/TCP 18h kube-system service/kube-dns ClusterIP 10.152.183.10 53/UDP,53/TCP,9153/TCP 18h controller-uk8s service/controller-service ClusterIP 10.152.183.135 17070/TCP 18h controller-uk8s service/modeloperator ClusterIP 10.152.183.229 17071/TCP 18h kubeflow service/modeloperator ClusterIP 10.152.183.254 17071/TCP 18h kubeflow service/admission-webhook-operator ClusterIP 10.152.183.191 30666/TCP 18h kubeflow service/argo-controller-operator ClusterIP 10.152.183.143 30666/TCP 18h kubeflow service/dex-auth-operator ClusterIP 10.152.183.140 30666/TCP 18h kubeflow service/istio-ingressgateway-operator ClusterIP 10.152.183.129 30666/TCP 18h kubeflow service/admission-webhook ClusterIP 10.152.183.19 443/TCP 18h kubeflow service/istio-pilot-operator ClusterIP 10.152.183.212 30666/TCP 18h kubeflow service/jupyter-controller-operator ClusterIP 10.152.183.3 30666/TCP 18h kubeflow service/jupyter-ui-operator ClusterIP 10.152.183.227 30666/TCP 18h kubeflow service/dex-auth ClusterIP 10.152.183.13 5556/TCP 18h kubeflow service/kfp-api-operator ClusterIP 10.152.183.63 30666/TCP 18h kubeflow service/kfp-db-operator ClusterIP 10.152.183.219 30666/TCP 18h kubeflow service/jupyter-ui ClusterIP 10.152.183.133 5000/TCP 18h kubeflow service/kfp-persistence-operator ClusterIP 10.152.183.246 30666/TCP 18h kubeflow service/kfp-db ClusterIP 10.152.183.239 3306/TCP 18h kubeflow service/kfp-db-endpoints ClusterIP None 18h kubeflow service/istio-pilot ClusterIP 10.152.183.233 8080/TCP,15010/TCP,15012/TCP,15017/TCP 18h kubeflow service/seldon-controller-manager-operator ClusterIP 10.152.183.36 30666/TCP 18h kubeflow service/kfp-schedwf-operator ClusterIP 10.152.183.117 30666/TCP 18h kubeflow service/kfp-ui-operator ClusterIP 10.152.183.238 30666/TCP 18h kubeflow service/kfp-viz-operator ClusterIP 10.152.183.119 30666/TCP 18h kubeflow service/kubeflow-profiles-operator ClusterIP 10.152.183.247 30666/TCP 18h kubeflow service/minio-operator ClusterIP 10.152.183.81 30666/TCP 18h kubeflow service/oidc-gatekeeper-operator ClusterIP 10.152.183.109 30666/TCP 18h kubeflow service/pytorch-operator-operator ClusterIP 10.152.183.244 30666/TCP 18h kubeflow service/tfjob-operator-operator ClusterIP 10.152.183.134 30666/TCP 18h kubeflow service/kfp-viewer-operator ClusterIP 10.152.183.151 30666/TCP 18h kubeflow service/seldon-controller-manager ClusterIP 10.152.183.69 8080/TCP,4443/TCP 18h kubeflow service/kubeflow-dashboard-operator ClusterIP 10.152.183.130 30666/TCP 18h kubeflow service/mlmd-operator ClusterIP 10.152.183.160 30666/TCP 18h kubeflow service/kubeflow-profiles ClusterIP 10.152.183.108 8080/TCP,8081/TCP 18h kubeflow service/kfp-viz ClusterIP 10.152.183.24 8888/TCP 18h kubeflow service/minio ClusterIP 10.152.183.118 9000/TCP 18h kubeflow service/minio-endpoints ClusterIP None 18h kubeflow service/mlmd ClusterIP 10.152.183.214 8080/TCP 18h kubeflow service/mlmd-endpoints ClusterIP None 18h kubeflow service/ml-pipeline ClusterIP 10.152.183.220 8887/TCP,8888/TCP 18h kubeflow service/kubeflow-dashboard ClusterIP 10.152.183.213 8082/TCP 18h kubeflow service/oidc-gatekeeper ClusterIP 10.152.183.144 8080/TCP 17h kubeflow service/kfp-api ClusterIP 10.152.183.104 8888/TCP,8887/TCP 17h kubeflow service/kfp-ui ClusterIP 10.152.183.226 3000/TCP 17h kubeflow service/istio-ingressgateway LoadBalancer 10.152.183.131 10.64.140.43 15020:30975/TCP,80:30867/TCP,443:30879/TCP,15029:30529/TCP,15030:31280/TCP,15031:32546/TCP,15032:30608/TCP,15443:31540/TCP,15011:30785/TCP,8060:32290/TCP,853:30859/TCP 17h kubeflow service/seldon-webhook-service ClusterIP 10.152.183.9 4443/TCP 17h

NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-system daemonset.apps/calico-node 1 1 1 1 1 kubernetes.io/os=linux 18h metallb-system daemonset.apps/speaker 1 1 1 1 1 beta.kubernetes.io/os=linux 18h ingress daemonset.apps/nginx-ingress-microk8s-controller 1 1 1 1 1 18h

NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE kube-system deployment.apps/calico-kube-controllers 1/1 1 1 18h kube-system deployment.apps/coredns 1/1 1 1 18h metallb-system deployment.apps/controller 1/1 1 1 18h kube-system deployment.apps/hostpath-provisioner 1/1 1 1 18h controller-uk8s deployment.apps/modeloperator 1/1 1 1 18h kubeflow deployment.apps/modeloperator 1/1 1 1 18h kubeflow deployment.apps/jupyter-controller 1/1 1 1 18h kubeflow deployment.apps/admission-webhook 1/1 1 1 18h kubeflow deployment.apps/jupyter-ui 1/1 1 1 18h kubeflow deployment.apps/kfp-schedwf 1/1 1 1 18h kubeflow deployment.apps/istio-pilot 1/1 1 1 18h kubeflow deployment.apps/kfp-viz 1/1 1 1 18h kubeflow deployment.apps/pytorch-operator 1/1 1 1 18h kubeflow deployment.apps/tfjob-operator 1/1 1 1 18h kubeflow deployment.apps/kfp-viewer 1/1 1 1 18h kubeflow deployment.apps/kubeflow-dashboard 1/1 1 1 17h kubeflow deployment.apps/kfp-ui 1/1 1 1 17h kubeflow deployment.apps/istio-ingressgateway 1/1 1 1 17h kubeflow deployment.apps/kfp-api 1/1 1 1 17h kubeflow deployment.apps/kubeflow-profiles 1/1 1 1 18h kubeflow deployment.apps/oidc-gatekeeper 1/1 1 1 17h kubeflow deployment.apps/kfp-persistence 1/1 1 1 17h kubeflow deployment.apps/dex-auth 1/1 1 1 18h kubeflow deployment.apps/seldon-controller-manager 0/1 1 0 18h kubeflow deployment.apps/argo-controller 0/1 1 0 17h

NAMESPACE NAME DESIRED CURRENT READY AGE kube-system replicaset.apps/calico-kube-controllers-f7868dd95 1 1 1 18h kube-system replicaset.apps/coredns-7f9c69c78c 1 1 1 18h metallb-system replicaset.apps/controller-559b68bfd8 1 1 1 18h kube-system replicaset.apps/hostpath-provisioner-5c65fbdb4f 1 1 1 18h controller-uk8s replicaset.apps/modeloperator-649944bf89 1 1 1 18h kubeflow replicaset.apps/modeloperator-658b4b6c58 1 1 1 18h kubeflow replicaset.apps/admission-webhook-795c896784 1 1 1 18h kubeflow replicaset.apps/jupyter-ui-6d87f8dc8 1 1 1 18h kubeflow replicaset.apps/jupyter-controller-5b9b44fdc4 1 1 1 18h kubeflow replicaset.apps/istio-pilot-7bfdbc474b 1 1 1 18h kubeflow replicaset.apps/kfp-schedwf-bc96bbbd6 1 1 1 18h kubeflow replicaset.apps/kfp-viz-67775f7888 1 1 1 18h kubeflow replicaset.apps/pytorch-operator-568d56c769 1 1 1 18h kubeflow replicaset.apps/tfjob-operator-965d5c769 1 1 1 18h kubeflow replicaset.apps/kfp-viewer-6d96fbf466 1 1 1 18h kubeflow replicaset.apps/kubeflow-dashboard-694d66ffb6 1 1 1 17h kubeflow replicaset.apps/kfp-ui-869494c98c 1 1 1 17h kubeflow replicaset.apps/istio-ingressgateway-957447478 1 1 1 17h kubeflow replicaset.apps/kfp-api-54dd7dc858 1 1 1 17h kubeflow replicaset.apps/kubeflow-profiles-7d54db8b75 1 1 1 18h kubeflow replicaset.apps/oidc-gatekeeper-748687b564 1 1 1 17h kubeflow replicaset.apps/kfp-persistence-648f685479 1 1 1 17h kubeflow replicaset.apps/dex-auth-5f68f57bc9 1 1 1 141m kubeflow replicaset.apps/seldon-controller-manager-5c8fbffc67 1 1 0 18h kubeflow replicaset.apps/argo-controller-84468669d4 1 1 0 17h

NAMESPACE NAME READY AGE controller-uk8s statefulset.apps/controller 1/1 18h kubeflow statefulset.apps/admission-webhook-operator 1/1 18h kubeflow statefulset.apps/argo-controller-operator 1/1 18h kubeflow statefulset.apps/dex-auth-operator 1/1 18h kubeflow statefulset.apps/jupyter-ui-operator 1/1 18h kubeflow statefulset.apps/istio-ingressgateway-operator 1/1 18h kubeflow statefulset.apps/istio-pilot-operator 1/1 18h kubeflow statefulset.apps/jupyter-controller-operator 1/1 18h kubeflow statefulset.apps/kfp-db-operator 1/1 18h kubeflow statefulset.apps/kfp-api-operator 1/1 18h kubeflow statefulset.apps/kfp-persistence-operator 1/1 18h kubeflow statefulset.apps/seldon-controller-manager-operator 1/1 18h kubeflow statefulset.apps/kubeflow-profiles-operator 1/1 18h kubeflow statefulset.apps/kfp-ui-operator 1/1 18h kubeflow statefulset.apps/kfp-schedwf-operator 1/1 18h kubeflow statefulset.apps/kfp-viz-operator 1/1 18h kubeflow statefulset.apps/minio-operator 1/1 18h kubeflow statefulset.apps/oidc-gatekeeper-operator 1/1 18h kubeflow statefulset.apps/pytorch-operator-operator 1/1 18h kubeflow statefulset.apps/tfjob-operator-operator 1/1 18h kubeflow statefulset.apps/kfp-viewer-operator 1/1 18h kubeflow statefulset.apps/kubeflow-dashboard-operator 1/1 18h kubeflow statefulset.apps/mlmd-operator 1/1 18h kubeflow statefulset.apps/kfp-db 1/1 18h kubeflow statefulset.apps/minio 1/1 18h kubeflow statefulset.apps/mlmd 1/1 18h

knkski commented 3 years ago

I think the root issue you're seeing here is that istio-ingressgateway wasn't set up correctly. Although it eventually came up, at the time this code ran, service/istio-ingressgateway didn't have a loadbalancer IP set up by metallb yet.

You should be able to fix this by calculating the right hostname and running these commands again. That would end up being

juju config dex-auth public-url=10.64.140.43.nip.io
juju config oidc-gatekeeper public-url=10.64.140.43.nip.io

As far as why istio-ingressgateway had issues, it's a little hard to diagnose what exactly went wrong. That charm relies on reading a configmap generated by istio-pilot. That code only runs when Juju executes the operator code, and the default update-status-hook-interval is 5m for a model. So if istio-ingressgateway can't read that configmap in the first few hooks that run, it only checks every 5 minutes by default. We should probably import this code over to microk8s enable kubeflow to handle that situation a little better.

ca-scribner commented 3 years ago

After more debugging I could not get the oidc-gatekeeper application to behave, so I tried:

juju remove-application dex-auth oidc-gatekeeper --force

microk8s juju deploy dex-auth 
microk8s juju deploy oidc-gatekeeper

microk8s juju relate dex-auth:oidc-client oidc-gatekeeper:oidc-client
microk8s juju relate istio-pilot:ingress oidc-gatekeeper:ingress
microk8s juju relate istio-pilot:ingress-auth oidc-gatekeeper:ingress-auth

microk8s juju config dex-auth static-username=admin
microk8s juju config dex-auth static-password=admin
microk8s juju config dex-auth public-url=http://10.64.140.43.nip.io
microk8s juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io

After waiting for everything to normalize, I successfully connected to the Kubeflow dashboard. I'm not sure why the oidc-gatekeeper charm became unrecoverable (maybe related to the incorrect public-url=http://localhost? maybe something else that just made that harder to deal with?) or what to do about that.