Closed Heegreis closed 2 years ago
Hi @Heegreis,
This sounds like an issue with Redis not starting up - it's perfectly normal for the clearml-apiserver
to keep trying to connect (returning error when it gives up, assuming the pod will be restarted).
@valeriano-manassero do you have any idea how this can happen?
Hi, are you using KinD as Kubernetes cluster?
Can you pls post output of the following commands?
kubectl get po -A
kubectl -n clearml logs clearml-redis-master-0
Thank you.
I use kubeadm create my cluster.
$ helm install clearml-server allegroai/clearml -n clearml --create-namespace
I change the name to clearml-server
like in doc Kubernetes Using Helm | ClearML.
$ kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
clearml clearml-elastic-master-0 0/1 Pending 0 3m51s
clearml clearml-id-a04501281f0f444aa6f845d0f34f1783 0/1 Completed 0 4h5m
clearml clearml-server-agent-group-cpu-agent-765cf89496-5ssvt 1/1 Running 0 3m51s
clearml clearml-server-apiserver-7c7fb756ff-hzkp7 0/1 CrashLoopBackOff 5 (43s ago) 3m51s
clearml clearml-server-fileserver-d69698bf6-rwqsd 0/1 Pending 0 3m51s
clearml clearml-server-mongodb-86648c4756-bxfg9 0/1 Pending 0 3m51s
clearml clearml-server-redis-master-0 0/1 Pending 0 3m51s
clearml clearml-server-webserver-767cbb9b9d-tn6gx 1/1 Running 0 3m51s
kube-system calico-kube-controllers-75f8f6cc59-h49lx 1/1 Running 1 (8d ago) 33d
kube-system calico-node-c69zt 1/1 Running 1 (9h ago) 10h
kube-system calico-node-td9mr 1/1 Running 3 (34h ago) 33d
kube-system calico-node-vl475 1/1 Running 1 (8d ago) 33d
kube-system coredns-78fcd69978-9vnkl 1/1 Running 1 (8d ago) 33d
kube-system coredns-78fcd69978-f7gjp 1/1 Running 1 (8d ago) 33d
kube-system etcd-iris-k8s-master 1/1 Running 1 (8d ago) 33d
kube-system kube-apiserver-iris-k8s-master 1/1 Running 1 (8d ago) 33d
kube-system kube-controller-manager-iris-k8s-master 1/1 Running 1 (8d ago) 33d
kube-system kube-proxy-7r47z 1/1 Running 1 (9h ago) 10h
kube-system kube-proxy-twv8b 1/1 Running 3 (34h ago) 33d
kube-system kube-proxy-xthh8 1/1 Running 1 (8d ago) 33d
kube-system kube-scheduler-iris-k8s-master 1/1 Running 1 (8d ago) 33d
kube-system node-shell-04fc4f64-7ac2-41f7-b851-af26e20eb918 0/1 Completed 0 12h
kube-system node-shell-078404aa-7470-4fae-a099-5f90df3b64a9 0/1 Completed 0 28h
kube-system node-shell-0a7a3acc-38a0-4a66-8f21-43dad1c011ee 0/1 Completed 0 13h
kube-system node-shell-1dafd029-7a0a-4cd4-b41d-f7f5aea9edff 1/1 Running 0 8m2s
kube-system node-shell-2c2266d9-3f26-4216-b319-7641fcbc67e3 1/1 Running 0 6m30s
kube-system node-shell-40e8ac18-ee39-4529-aa88-0cac95b249d8 0/1 Completed 0 5h11m
kube-system node-shell-564a0a78-6f63-40f2-8105-c6fc047b1ce6 0/1 Completed 0 11h
kube-system node-shell-65222930-9d76-450e-a084-974960c0b26b 0/1 Completed 0 7h7m
kube-system node-shell-6692bdcd-43ab-45fe-bab3-1b1fad504bbe 0/1 Completed 0 13h
kube-system node-shell-7360be2e-60b6-4114-8845-275bf87f79ad 0/1 Completed 0 7h6m
kube-system node-shell-74a86d34-4cd9-46fd-932c-5b234f8dd0c7 0/1 Completed 0 6h21m
kube-system node-shell-8cac9960-3f60-4127-961a-24b884192d47 0/1 Completed 0 21h
kube-system node-shell-91153319-9746-47c3-9f51-f077722763a3 0/1 Completed 0 28h
kube-system node-shell-c32383cc-635b-42bd-8d9c-7ed60551715e 0/1 Completed 0 5h44m
kube-system node-shell-fa6ced43-58a4-425c-8fb8-7a3c1975b90a 0/1 Completed 0 6h20m
kube-system node-shell-fdb395be-a839-4bab-9c94-98d0568dc51c 0/1 Completed 0 21h
kube-system nvidia-device-plugin-jmzwt 1/1 Running 0 7h58m
lens-metrics kube-state-metrics-78596b555-ttdn6 1/1 Running 0 11h
lens-metrics node-exporter-hnp9s 1/1 Running 2 (34h ago) 33d
lens-metrics node-exporter-htghf 1/1 Running 56 (9m11s ago) 10h
lens-metrics node-exporter-k8tvc 1/1 Running 1 (8d ago) 33d
lens-metrics prometheus-0 1/1 Running 0 11h
$ kubectl -n clearml logs clearml-server-redis-master-0
$
The output is empty.
@Heegreis as the information box at the title of Kubernetes Using Helm | ClearML says, this documentation is being updated - instead, you should use the instructions available at this repository (https://github.com/allegroai/clearml-helm-charts)
@jkhenning I know. In fact, I mainly refer to the instructions at this repository, and the doc part (Kubernetes Using Helm | ClearML) is used as an aid. Thanks for reminding.
I see Redis, Elastic and MongoDB in Pending
state, usually this is due to resources available but let's check it.
Can you pls post the result of following commands?
kubectl -n clearml describe po clearml-server-redis-master-0
kubectl -n clearml describe po clearml-elastic-master-0
kubectl -n clearml clearml-server-mongodb-86648c4756-bxfg9
Ty.
@valeriano-manassero Hi, below is the output.
$ kubectl -n clearml describe po clearml-server-redis-master-0
Name: clearml-server-redis-master-0
Namespace: clearml
Priority: 0
Node: <none>
Labels: app=redis
chart=redis-10.9.0
controller-revision-hash=clearml-server-redis-master-7b549db9bd
release=clearml-server
role=master
statefulset.kubernetes.io/pod-name=clearml-server-redis-master-0
Annotations: checksum/configmap: 8ae44a85458a357715b5a5ea9ec94c775b8d98f0bb8ee0ad50289a9b57338fb1
checksum/health: b30bd6fdb77ce2c8622ddcdc3263f0eb49f2190fae0b407b365ecf35eca603a7
checksum/secret: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/clearml-server-redis-master
Containers:
redis:
Image: docker.io/bitnami/redis:6.0.8-debian-10-r0
Port: 6379/TCP
Host Port: 0/TCP
Command:
/bin/bash
-c
if [[ -n $REDIS_PASSWORD_FILE ]]; then
password_aux=`cat ${REDIS_PASSWORD_FILE}`
export REDIS_PASSWORD=$password_aux
fi
if [[ ! -f /opt/bitnami/redis/etc/master.conf ]];then
cp /opt/bitnami/redis/mounted-etc/master.conf /opt/bitnami/redis/etc/master.conf
fi
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-27tx9:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 116s (x6685 over 4d19h) default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
$ kubectl -n clearml describe po clearml-elastic-master-0
Name: clearml-elastic-master-0
Namespace: clearml
Priority: 0
Node: <none>
Labels: app=clearml-elastic-master
chart=elasticsearch
controller-revision-hash=clearml-elastic-master-f5d695b86
release=clearml-server
statefulset.kubernetes.io/pod-name=clearml-elastic-master-0
Annotations: configchecksum: 74bf3a32b86b711225b81f59050eb46d9c7e332399326f6fd4ee8627b4febfa
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/clearml-elastic-master
Init Containers:
configure-sysctl:
Image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
Port: <none>
Host Port: <none>
Command:
sysctl
-w
vm.max_map_count=262144
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-svrqd (ro)
Containers:
elasticsearch:
Image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
Ports: 9200/TCP, 9300/TCP
Host Ports: 0/TCP, 0/TCP
Limits:
cpu: 1
memory: 4Gi
Requests:
cpu: 1
memory: 4Gi
Readiness: exec [sh -c #!/usr/bin/env bash -e
# If the node is starting up wait for the cluster to be ready (request params: "wait_for_status=yellow&timeout=1s" )
# Once it has started only check that the node itself is responding
START_FILE=/tmp/.es_start_file
# Disable nss cache to avoid filling dentry cache when calling curl
# This is required with Elasticsearch Docker using nss < 3.52
export NSS_SDB_USE_CACHE=no
http () {
local path="${1}"
local args="${2}"
set -- -XGET -s
if [ "$args" != "" ]; then
set -- "$@" $args
fi
if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
set -- "$@" -u "${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
fi
curl --output /dev/null -k "$@" "http://127.0.0.1:9200${path}"
}
if [ -f "${START_FILE}" ]; then
echo 'Elasticsearch is already running, lets check the node is healthy'
HTTP_CODE=$(http "/" "-w %{http_code}")
RC=$?
if [[ ${RC} -ne 0 ]]; then
echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with RC ${RC}"
exit ${RC}
fi
# ready if HTTP code 200, 503 is tolerable if ES version is 6.x
if [[ ${HTTP_CODE} == "200" ]]; then
exit 0
elif [[ ${HTTP_CODE} == "503" && "7" == "6" ]]; then
exit 0
else
echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with HTTP code ${HTTP_CODE}"
exit 1
fi
else
echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=yellow&timeout=1s" )'
if http "/_cluster/health?wait_for_status=yellow&timeout=1s" "--fail" ; then
touch ${START_FILE}
exit 0
else
echo 'Cluster is not yet ready (request params: "wait_for_status=yellow&timeout=1s" )'
exit 1
fi
fi
] delay=10s timeout=5s period=10s #success=3 #failure=3
Environment:
node.name: clearml-elastic-master-0 (v1:metadata.name)
cluster.initial_master_nodes: clearml-elastic-master-0,
discovery.seed_hosts: clearml-elastic-master-headless
cluster.name: clearml-elastic
network.host: 0.0.0.0
ES_JAVA_OPTS: -Xmx2g -Xms2g
node.data: true
node.ingest: true
node.master: true
node.remote_cluster_client: true
bootstrap.memory_lock: false
cluster.routing.allocation.node_initial_primaries_recoveries: 500
cluster.routing.allocation.disk.watermark.low: 500mb
cluster.routing.allocation.disk.watermark.high: 500mb
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
http.compression_level: 7
reindex.remote.whitelist: *.*
xpack.monitoring.enabled: false
xpack.security.enabled: false
Mounts:
/usr/share/elasticsearch/config/elasticsearch.yml from esconfig (rw,path="elasticsearch.yml")
/usr/share/elasticsearch/data from clearml-elastic-master (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-svrqd (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
clearml-elastic-master:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: clearml-elastic-master-clearml-elastic-master-0
ReadOnly: false
esconfig:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: clearml-elastic-master-config
Optional: false
kube-api-access-svrqd:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m48s (x6685 over 4d19h) default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
$ kubectl -n clearml describe po clearml-server-mongodb-86648c4756-bxfg9
Name: clearml-server-mongodb-86648c4756-bxfg9
Namespace: clearml
Priority: 0
Node: <none>
Labels: app.kubernetes.io/component=mongodb
app.kubernetes.io/instance=clearml-server
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=mongodb
helm.sh/chart=mongodb-10.3.4
pod-template-hash=86648c4756
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/clearml-server-mongodb-86648c4756
Containers:
mongodb:
Image: docker.io/bitnami/mongodb:4.4.3-debian-10-r0
Port: 27017/TCP
Host Port: 0/TCP
Liveness: exec [mongo --eval db.adminCommand('ping')] delay=30s timeout=5s period=10s #success=1 #failure=6
Readiness: exec [mongo --eval db.adminCommand('ping')] delay=5s timeout=5s period=10s #success=1 #failure=6
Environment:
BITNAMI_DEBUG: false
ALLOW_EMPTY_PASSWORD: yes
MONGODB_SYSTEM_LOG_VERBOSITY: 0
MONGODB_DISABLE_SYSTEM_LOG: no
MONGODB_ENABLE_IPV6: no
MONGODB_ENABLE_DIRECTORY_PER_DB: no
Mounts:
/bitnami/mongodb from datadir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bhvbw (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
datadir:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: clearml-server-mongodb
ReadOnly: false
kube-api-access-bhvbw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 57s (x6694 over 4d19h) default-scheduler 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
PVC are not ready, I guess the issue is with StorageClass used; by default standard
is set in values but you need to check a dynamic storage provisioner is up and running in your cluster and set it to default. In systems created by KinD everything is set by default.
Thanks for replying.
I roughly understand the cause. But I am new to k8s, and I am currently stuck in setting the provisioner.
Because I am not using cloud services, I may also need to set up NFS or local provisioner, etc.
I have successfully used docker-compose to build ClearML Server and cooperated with clearml-agemt k8s glue. So I do not set up ClaerML Server through Helm for now.
Did you ever figure out this error? I'm also facing the exact same issue with PersistentVolumeClaims
. Not sure how to resolve this as I'm not using kind either.
I use chart version 3.0.6 install clearml helm:
helm install clearml allegroai/clearml --create-namespace --namespace=clearml
But the clearml-apiserver status is CrashLoopBackOff.
Pod Logs show
clearml-redis-master:6379. Connection refused
clearml-elastic-master, clearml-fileserver, clearml-mongodb and clearml-redis are all stuck in pending