Closed Barteus closed 1 year ago
this bug appears to be the same and has some good additional details
I managed to reproduce the issue. It looks like something is not correct with shutting down of notebook server when VM is stopped. There is a workaround.
More investigation. When there are two or more notebook servers, if just one of them is properly stopped (eg. via UI) before VM restart, all notebook servers can be restarted and connected to after VM is restarted.
Looking at https://github.com/canonical/bundle-kubeflow/issues/515 I see that OS disk is 64GB which should be enough for Kubeflow deployment and OS. However, when additional volumes are created we adding to total required. There are two volumes added to notebook server 10GB each. I was just wondering if that could cause some issue on the startup of notebook server pods when VM is restarted.
Can we confirm what disk sizes were used when this issue occurred? Need to make sure that mount points for those notebook volumes are not eating into Kubeflow storage requirements.
After letting VM to sit in stopped state for couple of days I was able to reproduce the issue.
Notebook server with name new-test
was left running before VM shutdown.
After the restart all pods in kubeflow
and admin
namespaces are in Running
state.
Notebook server pod new-test
does not exist (usually it was available and in Running
state, that's when notebook server was accessible after VM restart).
This the info gathered after restart.
$ microk8s.kubectl -n kubeflow describe pod jupyter-controller-operator-0
Name: jupyter-controller-operator-0
Namespace: kubeflow
Priority: 0
Node: kf-test/10.128.0.14
Start Time: Mon, 28 Nov 2022 15:01:36 +0000
Labels: controller-revision-hash=jupyter-controller-operator-68fc77d85d
operator.juju.is/name=jupyter-controller
operator.juju.is/target=application
statefulset.kubernetes.io/pod-name=jupyter-controller-operator-0
Annotations: apparmor.security.beta.kubernetes.io/pod: runtime/default
cni.projectcalico.org/podIP: 10.1.211.219/32
cni.projectcalico.org/podIPs: 10.1.211.219/32
controller.juju.is/id: f957d721-f53d-41b6-8ef1-662083ae049e
juju.is/version: 2.9.34
model.juju.is/id: d0348bd4-17ab-4e48-84da-b30afeafdfc5
seccomp.security.beta.kubernetes.io/pod: docker/default
Status: Running
IP: 10.1.211.219
IPs:
IP: 10.1.211.219
Controlled By: StatefulSet/jupyter-controller-operator
Containers:
juju-operator:
Container ID: containerd://749cd9a3b114b2cb9d0142e3d9e6eafdbd7685b6f8853d2dd36e51d34f5ab09e
Image: jujusolutions/jujud-operator:2.9.34
Image ID: docker.io/jujusolutions/jujud-operator@sha256:3b46568ca590857dfa053ea84eea457a3389de34dd8775f0b32bfb2c0a55f700
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
export JUJU_DATA_DIR=/var/lib/juju
export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools
mkdir -p $JUJU_TOOLS_DIR
cp /opt/jujud $JUJU_TOOLS_DIR/jujud
$JUJU_TOOLS_DIR/jujud caasoperator --application-name=jupyter-controller --debug
State: Running
Started: Wed, 30 Nov 2022 19:11:18 +0000
Last State: Terminated
Reason: Unknown
Exit Code: 255
Started: Mon, 28 Nov 2022 15:01:39 +0000
Finished: Wed, 30 Nov 2022 19:06:02 +0000
Ready: True
Restart Count: 1
Environment:
JUJU_APPLICATION: jupyter-controller
JUJU_OPERATOR_SERVICE_IP: 10.152.183.234
JUJU_OPERATOR_POD_IP: (v1:status.podIP)
JUJU_OPERATOR_NAMESPACE: kubeflow (v1:metadata.namespace)
Mounts:
/var/lib/juju/agents/application-jupyter-controller/operator.yaml from jupyter-controller-operator-config (rw,path="operator.yaml")
/var/lib/juju/agents/application-jupyter-controller/template-agent.conf from jupyter-controller-operator-config (rw,path="template-agent.conf")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q4kkf (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
jupyter-controller-operator-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: jupyter-controller-operator-config
Optional: false
kube-api-access-q4kkf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 36m (x3 over 40m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 35m kubelet Container image "jujusolutions/jujud-operator:2.9.34" already present on machine
Normal Created 35m kubelet Created container juju-operator
Normal Started 35m kubelet Started container juju-operator
$ microk8s.kubectl -n kubeflow describe pod jupyter-controller-5d4949ddd7-ng4fp
Name: jupyter-controller-5d4949ddd7-ng4fp
Namespace: kubeflow
Priority: 0
Node: kf-test/10.128.0.14
Start Time: Mon, 28 Nov 2022 15:02:28 +0000
Labels: app.kubernetes.io/name=jupyter-controller
pod-template-hash=5d4949ddd7
Annotations: apparmor.security.beta.kubernetes.io/pod: runtime/default
charm.juju.is/modified-version: 0
cni.projectcalico.org/podIP: 10.1.212.40/32
cni.projectcalico.org/podIPs: 10.1.212.40/32
controller.juju.is/id: f957d721-f53d-41b6-8ef1-662083ae049e
model.juju.is/id: d0348bd4-17ab-4e48-84da-b30afeafdfc5
seccomp.security.beta.kubernetes.io/pod: docker/default
unit.juju.is/id: jupyter-controller/0
Status: Running
IP: 10.1.212.40
IPs:
IP: 10.1.212.40
Controlled By: ReplicaSet/jupyter-controller-5d4949ddd7
Init Containers:
juju-pod-init:
Container ID: containerd://fcb5a989f05f8a18a5acb2715e0575aa58f90e9f49a04d19c8963f20ea36555b
Image: jujusolutions/jujud-operator:2.9.34
Image ID: docker.io/jujusolutions/jujud-operator@sha256:3b46568ca590857dfa053ea84eea457a3389de34dd8775f0b32bfb2c0a55f700
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
export JUJU_DATA_DIR=/var/lib/juju
export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools
mkdir -p $JUJU_TOOLS_DIR
cp /opt/jujud $JUJU_TOOLS_DIR/jujud
initCmd=$($JUJU_TOOLS_DIR/jujud help commands | grep caas-unit-init)
if test -n "$initCmd"; then
$JUJU_TOOLS_DIR/jujud caas-unit-init --debug --wait;
else
exit 0
fi
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 30 Nov 2022 19:11:51 +0000
Finished: Wed, 30 Nov 2022 19:16:27 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/lib/juju from juju-data-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dmpn2 (ro)
Containers:
jupyter-controller:
Container ID: containerd://570b2722a9f13f7be7f05b0fa6c6db5c944672551c49f19ae9674dfeafbc0771
Image: registry.jujucharms.com/charm/kaq3thscd44n4eitar0ng5vn41qku3s076d4l/oci-image@sha256:8f4ec330927552d3bce6478f231c5415d01e019ad2b04b9c79565692dce360c1
Image ID: registry.jujucharms.com/charm/kaq3thscd44n4eitar0ng5vn41qku3s076d4l/oci-image@sha256:8f4ec330927552d3bce6478f231c5415d01e019ad2b04b9c79565692dce360c1
Port: <none>
Host Port: <none>
Command:
./manager
State: Running
Started: Wed, 30 Nov 2022 19:16:34 +0000
Last State: Terminated
Reason: Unknown
Exit Code: 255
Started: Mon, 28 Nov 2022 15:03:23 +0000
Finished: Wed, 30 Nov 2022 19:06:02 +0000
Ready: True
Restart Count: 1
Environment:
ENABLE_CULLING: true
ISTIO_GATEWAY: kubeflow/kubeflow-gateway
USE_ISTIO: true
Mounts:
/usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
/var/lib/juju from juju-data-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dmpn2 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
juju-data-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-dmpn2:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/arch=amd64
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 36m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2fe236223fa8f0dae56d0bb130386e980e4c70c9e9d68796b70809330df9597a": Get "https://[10.152.183.1]:443/apis/crd.projectcalico.org/v1/ipamconfigs/default": context deadline exceeded
Normal SandboxChanged 36m (x4 over 41m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 36m kubelet Container image "jujusolutions/jujud-operator:2.9.34" already present on machine
Normal Created 36m kubelet Created container juju-pod-init
Normal Started 36m kubelet Started container juju-pod-init
Normal Pulled 31m kubelet Container image "registry.jujucharms.com/charm/kaq3thscd44n4eitar0ng5vn41qku3s076d4l/oci-image@sha256:8f4ec330927552d3bce6478f231c5415d01e019ad2b04b9c79565692dce360c1" already present on machine
Normal Created 31m kubelet Created container jupyter-controller
Normal Started 31m kubelet Started container jupyter-controller
Logs from Jupyter pods are attached. jupyter-controller-pod.log jupyter-controller-operator-pod.log
Newly created notebook server had access to previous runs that were done in notebook server that did not restart (new-test
).
Leaving VM in stopped state overnight, usually results in notebook servers that do not startup. Waiting for 4-6 hours does not always result in the same behaviour.
Verified with Kubeflow 1.6 K8S 1.24. After restart of VM with MicroK8S cluster all service came up and connection to notebook server could be made. Details of deployment:
microk8s status | ||||
---|---|---|---|---|
dns | ||||
ha-cluster | ||||
hostpath-storage | ||||
ingress | ||||
metallb | ||||
storage | ||||
juju status | ||||
App | Version | Charm | Channel | Rev |
admission-webhook | res:oci-image@129fe92 | admission-webhook | 1.6/stable | 60 |
argo-controller | res:oci-image@669ebd5 | argo-controller | 3.3/stable | 99 |
argo-server | res:oci-image@576d038 | argo-server | 3.3/stable | 45 |
dex-auth | dex-auth | 2.31/stable | 129 | |
istio-ingressgateway | istio-gateway | 1.11/stable | 114 | |
istio-pilot | istio-pilot | 1.11/stable | 131 | |
jupyter-controller | res:oci-image@e05857e | jupyter-controller | 1.6/stable | 163 |
jupyter-ui | res:oci-image@d55c600 | jupyter-ui | 1.6/stable | 124 |
katib-controller | res:oci-image@03d47fb | katib-controller | 0.14/stable | 92 |
katib-db | mariadb/server:10.3 | charmed-osm-mariadb-k8s | latest/stable | 35 |
katib-db-manager | res:oci-image@16b33a5 | katib-db-manager | 0.14/stable | 66 |
katib-ui | res:oci-image@c7dc04a | katib-ui | 0.14/stable | 90 |
kfp-api | res:oci-image@bf747d5 | kfp-api | 2.0/stable | 144 |
kfp-db | mariadb/server:10.3 | charmed-osm-mariadb-k8s | latest/stable | 35 |
kfp-persistence | res:oci-image@abcf971 | kfp-persistence | 2.0/stable | 141 |
kfp-profile-controller | res:oci-image@b4de878 | kfp-profile-controller | 2.0/stable | 125 |
kfp-schedwf | res:oci-image@9c9f710 | kfp-schedwf | 2.0/stable | 155 |
kfp-ui | res:oci-image@47864af | kfp-ui | 2.0/stable | 144 |
kfp-viewer | res:oci-image@94754c0 | kfp-viewer | 2.0/stable | 152 |
kfp-viz | res:oci-image@23ab9b9 | kfp-viz | 2.0/stable | 134 |
kubeflow-dashboard | res:oci-image@6fe6eec | kubeflow-dashboard | 1.6/stable | 183 |
kubeflow-profiles | res:profile-image@cfd6935 | kubeflow-profiles | 1.6/stable | 94 |
kubeflow-roles | kubeflow-roles | 1.6/stable | 49 | |
kubeflow-volumes | res:oci-image@fdb4a5d | kubeflow-volumes | 1.6/stable | 84 |
metacontroller-operator | metacontroller-operator | 2.0/stable | 48 | |
minio | res:oci-image@1755999 | minio | ckf-1.6/stable | 99 |
mlflow-db | mariadb/server:10.3 | charmed-osm-mariadb-k8s | stable | 35 |
mlflow-server | res:oci-image@bba33cd | mlflow-server | stable | 77 |
oidc-gatekeeper | res:oci-image@32de216 | oidc-gatekeeper | ckf-1.6/stable | 76 |
seldon-controller-manager | res:oci-image@eb811b6 | seldon-core | 1.14/stable | 92 |
tensorboard-controller | res:oci-image@51058f7 | tensorboard-controller | 1.6/stable | 69 |
tensorboards-web-app | res:oci-image@eef68a5 | tensorboards-web-app | 1.6/stable | 71 |
training-operator | training-operator | 1.5/stable | 65 |
If problem occurs again. New issue will be opened. Closing.
Reproduce:
Expected: Notebook starts
Environment: OS - Ubuntu 20.04 microk8s - 1.22 Kubeflow - 1.6