allegroai / clearml-helm-charts

Helm chart repository for the new unified way to deploy ClearML on Kubernetes. ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
36 stars 50 forks source link

apiserver error: clearml-redis-master:6379. Connection refused #34

Closed Heegreis closed 2 years ago

Heegreis commented 2 years ago

I use chart version 3.0.6 install clearml helm: helm install clearml allegroai/clearml --create-namespace --namespace=clearml

But the clearml-apiserver status is CrashLoopBackOff.
Pod Logs show clearml-redis-master:6379. Connection refused

clearml-elastic-master, clearml-fileserver, clearml-mongodb and clearml-redis are all stuck in pending

jkhenning commented 2 years ago

Hi @Heegreis,

This sounds like an issue with Redis not starting up - it's perfectly normal for the clearml-apiserver to keep trying to connect (returning error when it gives up, assuming the pod will be restarted).

@valeriano-manassero do you have any idea how this can happen?

valeriano-manassero commented 2 years ago

Hi, are you using KinD as Kubernetes cluster?

Can you pls post output of the following commands?

kubectl get po -A

kubectl -n clearml logs clearml-redis-master-0

Thank you.

Heegreis commented 2 years ago

I use kubeadm create my cluster.

$ helm install clearml-server allegroai/clearml -n clearml --create-namespace

I change the name to clearml-server like in doc Kubernetes Using Helm | ClearML.

$ kubectl get po -A
NAMESPACE      NAME                                                    READY   STATUS             RESTARTS         AGE
clearml        clearml-elastic-master-0                                0/1     Pending            0                3m51s
clearml        clearml-id-a04501281f0f444aa6f845d0f34f1783             0/1     Completed          0                4h5m
clearml        clearml-server-agent-group-cpu-agent-765cf89496-5ssvt   1/1     Running            0                3m51s
clearml        clearml-server-apiserver-7c7fb756ff-hzkp7               0/1     CrashLoopBackOff   5 (43s ago)      3m51s
clearml        clearml-server-fileserver-d69698bf6-rwqsd               0/1     Pending            0                3m51s
clearml        clearml-server-mongodb-86648c4756-bxfg9                 0/1     Pending            0                3m51s
clearml        clearml-server-redis-master-0                           0/1     Pending            0                3m51s
clearml        clearml-server-webserver-767cbb9b9d-tn6gx               1/1     Running            0                3m51s
kube-system    calico-kube-controllers-75f8f6cc59-h49lx                1/1     Running            1 (8d ago)       33d
kube-system    calico-node-c69zt                                       1/1     Running            1 (9h ago)       10h
kube-system    calico-node-td9mr                                       1/1     Running            3 (34h ago)      33d
kube-system    calico-node-vl475                                       1/1     Running            1 (8d ago)       33d
kube-system    coredns-78fcd69978-9vnkl                                1/1     Running            1 (8d ago)       33d
kube-system    coredns-78fcd69978-f7gjp                                1/1     Running            1 (8d ago)       33d
kube-system    etcd-iris-k8s-master                                    1/1     Running            1 (8d ago)       33d
kube-system    kube-apiserver-iris-k8s-master                          1/1     Running            1 (8d ago)       33d
kube-system    kube-controller-manager-iris-k8s-master                 1/1     Running            1 (8d ago)       33d
kube-system    kube-proxy-7r47z                                        1/1     Running            1 (9h ago)       10h
kube-system    kube-proxy-twv8b                                        1/1     Running            3 (34h ago)      33d
kube-system    kube-proxy-xthh8                                        1/1     Running            1 (8d ago)       33d
kube-system    kube-scheduler-iris-k8s-master                          1/1     Running            1 (8d ago)       33d
kube-system    node-shell-04fc4f64-7ac2-41f7-b851-af26e20eb918         0/1     Completed          0                12h
kube-system    node-shell-078404aa-7470-4fae-a099-5f90df3b64a9         0/1     Completed          0                28h
kube-system    node-shell-0a7a3acc-38a0-4a66-8f21-43dad1c011ee         0/1     Completed          0                13h
kube-system    node-shell-1dafd029-7a0a-4cd4-b41d-f7f5aea9edff         1/1     Running            0                8m2s
kube-system    node-shell-2c2266d9-3f26-4216-b319-7641fcbc67e3         1/1     Running            0                6m30s
kube-system    node-shell-40e8ac18-ee39-4529-aa88-0cac95b249d8         0/1     Completed          0                5h11m
kube-system    node-shell-564a0a78-6f63-40f2-8105-c6fc047b1ce6         0/1     Completed          0                11h
kube-system    node-shell-65222930-9d76-450e-a084-974960c0b26b         0/1     Completed          0                7h7m
kube-system    node-shell-6692bdcd-43ab-45fe-bab3-1b1fad504bbe         0/1     Completed          0                13h
kube-system    node-shell-7360be2e-60b6-4114-8845-275bf87f79ad         0/1     Completed          0                7h6m
kube-system    node-shell-74a86d34-4cd9-46fd-932c-5b234f8dd0c7         0/1     Completed          0                6h21m
kube-system    node-shell-8cac9960-3f60-4127-961a-24b884192d47         0/1     Completed          0                21h
kube-system    node-shell-91153319-9746-47c3-9f51-f077722763a3         0/1     Completed          0                28h
kube-system    node-shell-c32383cc-635b-42bd-8d9c-7ed60551715e         0/1     Completed          0                5h44m
kube-system    node-shell-fa6ced43-58a4-425c-8fb8-7a3c1975b90a         0/1     Completed          0                6h20m
kube-system    node-shell-fdb395be-a839-4bab-9c94-98d0568dc51c         0/1     Completed          0                21h
kube-system    nvidia-device-plugin-jmzwt                              1/1     Running            0                7h58m
lens-metrics   kube-state-metrics-78596b555-ttdn6                      1/1     Running            0                11h
lens-metrics   node-exporter-hnp9s                                     1/1     Running            2 (34h ago)      33d
lens-metrics   node-exporter-htghf                                     1/1     Running            56 (9m11s ago)   10h
lens-metrics   node-exporter-k8tvc                                     1/1     Running            1 (8d ago)       33d
lens-metrics   prometheus-0                                            1/1     Running            0                11h
$ kubectl -n clearml logs clearml-server-redis-master-0
$

The output is empty.

jkhenning commented 2 years ago

@Heegreis as the information box at the title of Kubernetes Using Helm | ClearML says, this documentation is being updated - instead, you should use the instructions available at this repository (https://github.com/allegroai/clearml-helm-charts)

Heegreis commented 2 years ago

@jkhenning I know. In fact, I mainly refer to the instructions at this repository, and the doc part (Kubernetes Using Helm | ClearML) is used as an aid. Thanks for reminding.

valeriano-manassero commented 2 years ago

I see Redis, Elastic and MongoDB in Pending state, usually this is due to resources available but let's check it. Can you pls post the result of following commands?

kubectl -n clearml describe po clearml-server-redis-master-0
kubectl -n clearml describe po clearml-elastic-master-0
kubectl -n clearml clearml-server-mongodb-86648c4756-bxfg9

Ty.

Heegreis commented 2 years ago

@valeriano-manassero Hi, below is the output.

$ kubectl -n clearml describe po clearml-server-redis-master-0
Name:           clearml-server-redis-master-0
Namespace:      clearml
Priority:       0
Node:           <none>
Labels:         app=redis
                chart=redis-10.9.0
                controller-revision-hash=clearml-server-redis-master-7b549db9bd
                release=clearml-server
                role=master
                statefulset.kubernetes.io/pod-name=clearml-server-redis-master-0
Annotations:    checksum/configmap: 8ae44a85458a357715b5a5ea9ec94c775b8d98f0bb8ee0ad50289a9b57338fb1
                checksum/health: b30bd6fdb77ce2c8622ddcdc3263f0eb49f2190fae0b407b365ecf35eca603a7
                checksum/secret: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/clearml-server-redis-master
Containers:
  redis:
    Image:      docker.io/bitnami/redis:6.0.8-debian-10-r0
    Port:       6379/TCP
    Host Port:  0/TCP
    Command:
      /bin/bash
      -c
      if [[ -n $REDIS_PASSWORD_FILE ]]; then
        password_aux=`cat ${REDIS_PASSWORD_FILE}`
        export REDIS_PASSWORD=$password_aux
      fi
      if [[ ! -f /opt/bitnami/redis/etc/master.conf ]];then
        cp /opt/bitnami/redis/mounted-etc/master.conf /opt/bitnami/redis/etc/master.conf
      fi
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-27tx9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                      From               Message
  ----     ------            ----                     ----               -------
  Warning  FailedScheduling  116s (x6685 over 4d19h)  default-scheduler  0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
$ kubectl -n clearml describe po clearml-elastic-master-0
Name:           clearml-elastic-master-0
Namespace:      clearml
Priority:       0
Node:           <none>
Labels:         app=clearml-elastic-master
                chart=elasticsearch
                controller-revision-hash=clearml-elastic-master-f5d695b86
                release=clearml-server
                statefulset.kubernetes.io/pod-name=clearml-elastic-master-0
Annotations:    configchecksum: 74bf3a32b86b711225b81f59050eb46d9c7e332399326f6fd4ee8627b4febfa
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/clearml-elastic-master
Init Containers:
  configure-sysctl:
    Image:      docker.elastic.co/elasticsearch/elasticsearch:7.10.1
    Port:       <none>
    Host Port:  <none>
    Command:
      sysctl
      -w
      vm.max_map_count=262144
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-svrqd (ro)
Containers:
  elasticsearch:
    Image:       docker.elastic.co/elasticsearch/elasticsearch:7.10.1
    Ports:       9200/TCP, 9300/TCP
    Host Ports:  0/TCP, 0/TCP
    Limits:
      cpu:     1
      memory:  4Gi
    Requests:
      cpu:      1
      memory:   4Gi
    Readiness:  exec [sh -c #!/usr/bin/env bash -e
# If the node is starting up wait for the cluster to be ready (request params: "wait_for_status=yellow&timeout=1s" )
# Once it has started only check that the node itself is responding
START_FILE=/tmp/.es_start_file

# Disable nss cache to avoid filling dentry cache when calling curl
# This is required with Elasticsearch Docker using nss < 3.52
export NSS_SDB_USE_CACHE=no

http () {
  local path="${1}"
  local args="${2}"
  set -- -XGET -s

  if [ "$args" != "" ]; then
    set -- "$@" $args
  fi

  if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
    set -- "$@" -u "${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
  fi

  curl --output /dev/null -k "$@" "http://127.0.0.1:9200${path}"
}

if [ -f "${START_FILE}" ]; then
  echo 'Elasticsearch is already running, lets check the node is healthy'
  HTTP_CODE=$(http "/" "-w %{http_code}")
  RC=$?
  if [[ ${RC} -ne 0 ]]; then
    echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with RC ${RC}"
    exit ${RC}
  fi
  # ready if HTTP code 200, 503 is tolerable if ES version is 6.x
  if [[ ${HTTP_CODE} == "200" ]]; then
    exit 0
  elif [[ ${HTTP_CODE} == "503" && "7" == "6" ]]; then
    exit 0
  else
    echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with HTTP code ${HTTP_CODE}"
    exit 1
  fi

else
  echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=yellow&timeout=1s" )'
  if http "/_cluster/health?wait_for_status=yellow&timeout=1s" "--fail" ; then
    touch ${START_FILE}
    exit 0
  else
    echo 'Cluster is not yet ready (request params: "wait_for_status=yellow&timeout=1s" )'
    exit 1
  fi
fi
] delay=10s timeout=5s period=10s #success=3 #failure=3
    Environment:
      node.name:                                                     clearml-elastic-master-0 (v1:metadata.name)
      cluster.initial_master_nodes:                                  clearml-elastic-master-0,
      discovery.seed_hosts:                                          clearml-elastic-master-headless
      cluster.name:                                                  clearml-elastic
      network.host:                                                  0.0.0.0
      ES_JAVA_OPTS:                                                  -Xmx2g -Xms2g
      node.data:                                                     true
      node.ingest:                                                   true
      node.master:                                                   true
      node.remote_cluster_client:                                    true
      bootstrap.memory_lock:                                         false
      cluster.routing.allocation.node_initial_primaries_recoveries:  500
      cluster.routing.allocation.disk.watermark.low:                 500mb
      cluster.routing.allocation.disk.watermark.high:                500mb
      cluster.routing.allocation.disk.watermark.flood_stage:         500mb
      http.compression_level:                                        7
      reindex.remote.whitelist:                                      *.*
      xpack.monitoring.enabled:                                      false
      xpack.security.enabled:                                        false
    Mounts:
      /usr/share/elasticsearch/config/elasticsearch.yml from esconfig (rw,path="elasticsearch.yml")
      /usr/share/elasticsearch/data from clearml-elastic-master (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-svrqd (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  clearml-elastic-master:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  clearml-elastic-master-clearml-elastic-master-0
    ReadOnly:   false
  esconfig:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      clearml-elastic-master-config
    Optional:  false
  kube-api-access-svrqd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                       From               Message
  ----     ------            ----                      ----               -------
  Warning  FailedScheduling  4m48s (x6685 over 4d19h)  default-scheduler  0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
$ kubectl -n clearml describe po clearml-server-mongodb-86648c4756-bxfg9
Name:           clearml-server-mongodb-86648c4756-bxfg9
Namespace:      clearml
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/component=mongodb
                app.kubernetes.io/instance=clearml-server
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=mongodb
                helm.sh/chart=mongodb-10.3.4
                pod-template-hash=86648c4756
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/clearml-server-mongodb-86648c4756
Containers:
  mongodb:
    Image:      docker.io/bitnami/mongodb:4.4.3-debian-10-r0
    Port:       27017/TCP
    Host Port:  0/TCP
    Liveness:   exec [mongo --eval db.adminCommand('ping')] delay=30s timeout=5s period=10s #success=1 #failure=6
    Readiness:  exec [mongo --eval db.adminCommand('ping')] delay=5s timeout=5s period=10s #success=1 #failure=6
    Environment:
      BITNAMI_DEBUG:                    false
      ALLOW_EMPTY_PASSWORD:             yes
      MONGODB_SYSTEM_LOG_VERBOSITY:     0
      MONGODB_DISABLE_SYSTEM_LOG:       no
      MONGODB_ENABLE_IPV6:              no
      MONGODB_ENABLE_DIRECTORY_PER_DB:  no
    Mounts:
      /bitnami/mongodb from datadir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bhvbw (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  clearml-server-mongodb
    ReadOnly:   false
  kube-api-access-bhvbw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  57s (x6694 over 4d19h)  default-scheduler  0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
valeriano-manassero commented 2 years ago
0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.

PVC are not ready, I guess the issue is with StorageClass used; by default standard is set in values but you need to check a dynamic storage provisioner is up and running in your cluster and set it to default. In systems created by KinD everything is set by default.

Heegreis commented 2 years ago

Thanks for replying.

I roughly understand the cause. But I am new to k8s, and I am currently stuck in setting the provisioner.
Because I am not using cloud services, I may also need to set up NFS or local provisioner, etc.

I have successfully used docker-compose to build ClearML Server and cooperated with clearml-agemt k8s glue. So I do not set up ClaerML Server through Helm for now.

amirhmk commented 1 year ago

Did you ever figure out this error? I'm also facing the exact same issue with PersistentVolumeClaims. Not sure how to resolve this as I'm not using kind either.