Cannot start notebook after cluster restart. - Githubissues

canonical / notebook-operators

Charmed Jupyter Notebooks

Apache License 2.0

5 stars 9 forks source link

Cannot start notebook after cluster restart. #61

Closed Barteus closed 1 year ago

Barteus commented 2 years ago

Reproduce:

Install CKF using "Quick start".
Create a notebook.
Restart the machine/evict the notebook from the node.
Try starting the notebook -> notebook is always in status "No Pod are currently running for this Notebook Server"

Expected: Notebook starts

Environment: OS - Ubuntu 20.04 microk8s - 1.22 Kubeflow - 1.6

ca-scribner commented 1 year ago

this bug appears to be the same and has some good additional details

i-chvets commented 1 year ago

I managed to reproduce the issue. It looks like something is not correct with shutting down of notebook server when VM is stopped. There is a workaround.

Before shutting down the VM, stop notebook server(s).
After the restart of the VM, start notebook server.

i-chvets commented 1 year ago

More investigation. When there are two or more notebook servers, if just one of them is properly stopped (eg. via UI) before VM restart, all notebook servers can be restarted and connected to after VM is restarted.

i-chvets commented 1 year ago

Looking at https://github.com/canonical/bundle-kubeflow/issues/515 I see that OS disk is 64GB which should be enough for Kubeflow deployment and OS. However, when additional volumes are created we adding to total required. There are two volumes added to notebook server 10GB each. I was just wondering if that could cause some issue on the startup of notebook server pods when VM is restarted.

Can we confirm what disk sizes were used when this issue occurred? Need to make sure that mount points for those notebook volumes are not eating into Kubeflow storage requirements.

i-chvets commented 1 year ago

After letting VM to sit in stopped state for couple of days I was able to reproduce the issue. Notebook server with name new-test was left running before VM shutdown. After the restart all pods in kubeflow and admin namespaces are in Running state. Notebook server pod new-test does not exist (usually it was available and in Running state, that's when notebook server was accessible after VM restart).

This the info gathered after restart.

$ microk8s.kubectl -n kubeflow describe pod jupyter-controller-operator-0
Name:         jupyter-controller-operator-0
Namespace:    kubeflow
Priority:     0
Node:         kf-test/10.128.0.14
Start Time:   Mon, 28 Nov 2022 15:01:36 +0000
Labels:       controller-revision-hash=jupyter-controller-operator-68fc77d85d
              operator.juju.is/name=jupyter-controller
              operator.juju.is/target=application
              statefulset.kubernetes.io/pod-name=jupyter-controller-operator-0
Annotations:  apparmor.security.beta.kubernetes.io/pod: runtime/default
              cni.projectcalico.org/podIP: 10.1.211.219/32
              cni.projectcalico.org/podIPs: 10.1.211.219/32
              controller.juju.is/id: f957d721-f53d-41b6-8ef1-662083ae049e
              juju.is/version: 2.9.34
              model.juju.is/id: d0348bd4-17ab-4e48-84da-b30afeafdfc5
              seccomp.security.beta.kubernetes.io/pod: docker/default
Status:       Running
IP:           10.1.211.219
IPs:
  IP:           10.1.211.219
Controlled By:  StatefulSet/jupyter-controller-operator
Containers:
  juju-operator:
    Container ID:  containerd://749cd9a3b114b2cb9d0142e3d9e6eafdbd7685b6f8853d2dd36e51d34f5ab09e
    Image:         jujusolutions/jujud-operator:2.9.34
    Image ID:      docker.io/jujusolutions/jujud-operator@sha256:3b46568ca590857dfa053ea84eea457a3389de34dd8775f0b32bfb2c0a55f700
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools

      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud

      $JUJU_TOOLS_DIR/jujud caasoperator --application-name=jupyter-controller --debug

    State:          Running
      Started:      Wed, 30 Nov 2022 19:11:18 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 28 Nov 2022 15:01:39 +0000
      Finished:     Wed, 30 Nov 2022 19:06:02 +0000
    Ready:          True
    Restart Count:  1
    Environment:
      JUJU_APPLICATION:          jupyter-controller
      JUJU_OPERATOR_SERVICE_IP:  10.152.183.234
      JUJU_OPERATOR_POD_IP:       (v1:status.podIP)
      JUJU_OPERATOR_NAMESPACE:   kubeflow (v1:metadata.namespace)
    Mounts:
      /var/lib/juju/agents/application-jupyter-controller/operator.yaml from jupyter-controller-operator-config (rw,path="operator.yaml")
      /var/lib/juju/agents/application-jupyter-controller/template-agent.conf from jupyter-controller-operator-config (rw,path="template-agent.conf")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q4kkf (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  jupyter-controller-operator-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      jupyter-controller-operator-config
    Optional:  false
  kube-api-access-q4kkf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age                From     Message
  ----    ------          ----               ----     -------
  Normal  SandboxChanged  36m (x3 over 40m)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  Pulled          35m                kubelet  Container image "jujusolutions/jujud-operator:2.9.34" already present on machine
  Normal  Created         35m                kubelet  Created container juju-operator
  Normal  Started         35m                kubelet  Started container juju-operator

$ microk8s.kubectl -n kubeflow describe pod jupyter-controller-5d4949ddd7-ng4fp
Name:         jupyter-controller-5d4949ddd7-ng4fp
Namespace:    kubeflow
Priority:     0
Node:         kf-test/10.128.0.14
Start Time:   Mon, 28 Nov 2022 15:02:28 +0000
Labels:       app.kubernetes.io/name=jupyter-controller
              pod-template-hash=5d4949ddd7
Annotations:  apparmor.security.beta.kubernetes.io/pod: runtime/default
              charm.juju.is/modified-version: 0
              cni.projectcalico.org/podIP: 10.1.212.40/32
              cni.projectcalico.org/podIPs: 10.1.212.40/32
              controller.juju.is/id: f957d721-f53d-41b6-8ef1-662083ae049e
              model.juju.is/id: d0348bd4-17ab-4e48-84da-b30afeafdfc5
              seccomp.security.beta.kubernetes.io/pod: docker/default
              unit.juju.is/id: jupyter-controller/0
Status:       Running
IP:           10.1.212.40
IPs:
  IP:           10.1.212.40
Controlled By:  ReplicaSet/jupyter-controller-5d4949ddd7
Init Containers:
  juju-pod-init:
    Container ID:  containerd://fcb5a989f05f8a18a5acb2715e0575aa58f90e9f49a04d19c8963f20ea36555b
    Image:         jujusolutions/jujud-operator:2.9.34
    Image ID:      docker.io/jujusolutions/jujud-operator@sha256:3b46568ca590857dfa053ea84eea457a3389de34dd8775f0b32bfb2c0a55f700
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools

      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud

      initCmd=$($JUJU_TOOLS_DIR/jujud help commands | grep caas-unit-init)
      if test -n "$initCmd"; then
      $JUJU_TOOLS_DIR/jujud caas-unit-init --debug --wait;
      else
      exit 0
      fi

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 30 Nov 2022 19:11:51 +0000
      Finished:     Wed, 30 Nov 2022 19:16:27 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dmpn2 (ro)
Containers:
  jupyter-controller:
    Container ID:  containerd://570b2722a9f13f7be7f05b0fa6c6db5c944672551c49f19ae9674dfeafbc0771
    Image:         registry.jujucharms.com/charm/kaq3thscd44n4eitar0ng5vn41qku3s076d4l/oci-image@sha256:8f4ec330927552d3bce6478f231c5415d01e019ad2b04b9c79565692dce360c1
    Image ID:      registry.jujucharms.com/charm/kaq3thscd44n4eitar0ng5vn41qku3s076d4l/oci-image@sha256:8f4ec330927552d3bce6478f231c5415d01e019ad2b04b9c79565692dce360c1
    Port:          <none>
    Host Port:     <none>
    Command:
      ./manager
    State:          Running
      Started:      Wed, 30 Nov 2022 19:16:34 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 28 Nov 2022 15:03:23 +0000
      Finished:     Wed, 30 Nov 2022 19:06:02 +0000
    Ready:          True
    Restart Count:  1
    Environment:
      ENABLE_CULLING:  true
      ISTIO_GATEWAY:   kubeflow/kubeflow-gateway
      USE_ISTIO:       true
    Mounts:
      /usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dmpn2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  juju-data-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-dmpn2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                From     Message
  ----     ------                  ----               ----     -------
  Warning  FailedCreatePodSandBox  36m                kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2fe236223fa8f0dae56d0bb130386e980e4c70c9e9d68796b70809330df9597a": Get "https://[10.152.183.1]:443/apis/crd.projectcalico.org/v1/ipamconfigs/default": context deadline exceeded
  Normal   SandboxChanged          36m (x4 over 41m)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                  36m                kubelet  Container image "jujusolutions/jujud-operator:2.9.34" already present on machine
  Normal   Created                 36m                kubelet  Created container juju-pod-init
  Normal   Started                 36m                kubelet  Started container juju-pod-init
  Normal   Pulled                  31m                kubelet  Container image "registry.jujucharms.com/charm/kaq3thscd44n4eitar0ng5vn41qku3s076d4l/oci-image@sha256:8f4ec330927552d3bce6478f231c5415d01e019ad2b04b9c79565692dce360c1" already present on machine
  Normal   Created                 31m                kubelet  Created container jupyter-controller
  Normal   Started                 31m                kubelet  Started container jupyter-controller

Logs from Jupyter pods are attached. jupyter-controller-pod.log jupyter-controller-operator-pod.log

i-chvets commented 1 year ago

Newly created notebook server had access to previous runs that were done in notebook server that did not restart (new-test).

i-chvets commented 1 year ago

Leaving VM in stopped state overnight, usually results in notebook servers that do not startup. Waiting for 4-6 hours does not always result in the same behaviour.

i-chvets commented 1 year ago

Verified with Kubeflow 1.6 K8S 1.24. After restart of VM with MicroK8S cluster all service came up and connection to notebook server could be made. Details of deployment:

microk8s status
dns
ha-cluster
hostpath-storage
ingress
metallb
storage

juju status
App	Version	Charm	Channel	Rev
admission-webhook	res:oci-image@129fe92	admission-webhook	1.6/stable	60
argo-controller	res:oci-image@669ebd5	argo-controller	3.3/stable	99
argo-server	res:oci-image@576d038	argo-server	3.3/stable	45
dex-auth		dex-auth	2.31/stable	129
istio-ingressgateway		istio-gateway	1.11/stable	114
istio-pilot		istio-pilot	1.11/stable	131
jupyter-controller	res:oci-image@e05857e	jupyter-controller	1.6/stable	163
jupyter-ui	res:oci-image@d55c600	jupyter-ui	1.6/stable	124
katib-controller	res:oci-image@03d47fb	katib-controller	0.14/stable	92
katib-db	mariadb/server:10.3	charmed-osm-mariadb-k8s	latest/stable	35
katib-db-manager	res:oci-image@16b33a5	katib-db-manager	0.14/stable	66
katib-ui	res:oci-image@c7dc04a	katib-ui	0.14/stable	90
kfp-api	res:oci-image@bf747d5	kfp-api	2.0/stable	144
kfp-db	mariadb/server:10.3	charmed-osm-mariadb-k8s	latest/stable	35
kfp-persistence	res:oci-image@abcf971	kfp-persistence	2.0/stable	141
kfp-profile-controller	res:oci-image@b4de878	kfp-profile-controller	2.0/stable	125
kfp-schedwf	res:oci-image@9c9f710	kfp-schedwf	2.0/stable	155
kfp-ui	res:oci-image@47864af	kfp-ui	2.0/stable	144
kfp-viewer	res:oci-image@94754c0	kfp-viewer	2.0/stable	152
kfp-viz	res:oci-image@23ab9b9	kfp-viz	2.0/stable	134
kubeflow-dashboard	res:oci-image@6fe6eec	kubeflow-dashboard	1.6/stable	183
kubeflow-profiles	res:profile-image@cfd6935	kubeflow-profiles	1.6/stable	94
kubeflow-roles		kubeflow-roles	1.6/stable	49
kubeflow-volumes	res:oci-image@fdb4a5d	kubeflow-volumes	1.6/stable	84
metacontroller-operator		metacontroller-operator	2.0/stable	48
minio	res:oci-image@1755999	minio	ckf-1.6/stable	99
mlflow-db	mariadb/server:10.3	charmed-osm-mariadb-k8s	stable	35
mlflow-server	res:oci-image@bba33cd	mlflow-server	stable	77
oidc-gatekeeper	res:oci-image@32de216	oidc-gatekeeper	ckf-1.6/stable	76
seldon-controller-manager	res:oci-image@eb811b6	seldon-core	1.14/stable	92
tensorboard-controller	res:oci-image@51058f7	tensorboard-controller	1.6/stable	69
tensorboards-web-app	res:oci-image@eef68a5	tensorboards-web-app	1.6/stable	71
training-operator		training-operator	1.5/stable	65

i-chvets commented 1 year ago

If problem occurs again. New issue will be opened. Closing.