apiserver error: clearml-redis-master:6379. Connection refused

Heegreis commented 2 years ago

I use chart version 3.0.6 install clearml helm: helm install clearml allegroai/clearml --create-namespace --namespace=clearml

But the clearml-apiserver status is CrashLoopBackOff.
Pod Logs show clearml-redis-master:6379. Connection refused

clearml-elastic-master, clearml-fileserver, clearml-mongodb and clearml-redis are all stuck in pending

jkhenning commented 2 years ago

Hi @Heegreis,

This sounds like an issue with Redis not starting up - it's perfectly normal for the clearml-apiserver to keep trying to connect (returning error when it gives up, assuming the pod will be restarted).

@valeriano-manassero do you have any idea how this can happen?

valeriano-manassero commented 2 years ago

Hi, are you using KinD as Kubernetes cluster?

Can you pls post output of the following commands?

kubectl get po -A

kubectl -n clearml logs clearml-redis-master-0

Thank you.

Heegreis commented 2 years ago

I use kubeadm create my cluster.

$ helm install clearml-server allegroai/clearml -n clearml --create-namespace

I change the name to clearml-server like in doc Kubernetes Using Helm | ClearML.

$ kubectl get po -A
NAMESPACE      NAME                                                    READY   STATUS             RESTARTS         AGE
clearml        clearml-elastic-master-0                                0/1     Pending            0                3m51s
clearml        clearml-id-a04501281f0f444aa6f845d0f34f1783             0/1     Completed          0                4h5m
clearml        clearml-server-agent-group-cpu-agent-765cf89496-5ssvt   1/1     Running            0                3m51s
clearml        clearml-server-apiserver-7c7fb756ff-hzkp7               0/1     CrashLoopBackOff   5 (43s ago)      3m51s
clearml        clearml-server-fileserver-d69698bf6-rwqsd               0/1     Pending            0                3m51s
clearml        clearml-server-mongodb-86648c4756-bxfg9                 0/1     Pending            0                3m51s
clearml        clearml-server-redis-master-0                           0/1     Pending            0                3m51s
clearml        clearml-server-webserver-767cbb9b9d-tn6gx               1/1     Running            0                3m51s
kube-system    calico-kube-controllers-75f8f6cc59-h49lx                1/1     Running            1 (8d ago)       33d
kube-system    calico-node-c69zt                                       1/1     Running            1 (9h ago)       10h
kube-system    calico-node-td9mr                                       1/1     Running            3 (34h ago)      33d
kube-system    calico-node-vl475                                       1/1     Running            1 (8d ago)       33d
kube-system    coredns-78fcd69978-9vnkl                                1/1     Running            1 (8d ago)       33d
kube-system    coredns-78fcd69978-f7gjp                                1/1     Running            1 (8d ago)       33d
kube-system    etcd-iris-k8s-master                                    1/1     Running            1 (8d ago)       33d
kube-system    kube-apiserver-iris-k8s-master                          1/1     Running            1 (8d ago)       33d
kube-system    kube-controller-manager-iris-k8s-master                 1/1     Running            1 (8d ago)       33d
kube-system    kube-proxy-7r47z                                        1/1     Running            1 (9h ago)       10h
kube-system    kube-proxy-twv8b                                        1/1     Running            3 (34h ago)      33d
kube-system    kube-proxy-xthh8                                        1/1     Running            1 (8d ago)       33d
kube-system    kube-scheduler-iris-k8s-master                          1/1     Running            1 (8d ago)       33d
kube-system    node-shell-04fc4f64-7ac2-41f7-b851-af26e20eb918         0/1     Completed          0                12h
kube-system    node-shell-078404aa-7470-4fae-a099-5f90df3b64a9         0/1     Completed          0                28h
kube-system    node-shell-0a7a3acc-38a0-4a66-8f21-43dad1c011ee         0/1     Completed          0                13h
kube-system    node-shell-1dafd029-7a0a-4cd4-b41d-f7f5aea9edff         1/1     Running            0                8m2s
kube-system    node-shell-2c2266d9-3f26-4216-b319-7641fcbc67e3         1/1     Running            0                6m30s
kube-system    node-shell-40e8ac18-ee39-4529-aa88-0cac95b249d8         0/1     Completed          0                5h11m
kube-system    node-shell-564a0a78-6f63-40f2-8105-c6fc047b1ce6         0/1     Completed          0                11h
kube-system    node-shell-65222930-9d76-450e-a084-974960c0b26b         0/1     Completed          0                7h7m
kube-system    node-shell-6692bdcd-43ab-45fe-bab3-1b1fad504bbe         0/1     Completed          0                13h
kube-system    node-shell-7360be2e-60b6-4114-8845-275bf87f79ad         0/1     Completed          0                7h6m
kube-system    node-shell-74a86d34-4cd9-46fd-932c-5b234f8dd0c7         0/1     Completed          0                6h21m
kube-system    node-shell-8cac9960-3f60-4127-961a-24b884192d47         0/1     Completed          0                21h
kube-system    node-shell-91153319-9746-47c3-9f51-f077722763a3         0/1     Completed          0                28h
kube-system    node-shell-c32383cc-635b-42bd-8d9c-7ed60551715e         0/1     Completed          0                5h44m
kube-system    node-shell-fa6ced43-58a4-425c-8fb8-7a3c1975b90a         0/1     Completed          0                6h20m
kube-system    node-shell-fdb395be-a839-4bab-9c94-98d0568dc51c         0/1     Completed          0                21h
kube-system    nvidia-device-plugin-jmzwt                              1/1     Running            0                7h58m
lens-metrics   kube-state-metrics-78596b555-ttdn6                      1/1     Running            0                11h
lens-metrics   node-exporter-hnp9s                                     1/1     Running            2 (34h ago)      33d
lens-metrics   node-exporter-htghf                                     1/1     Running            56 (9m11s ago)   10h
lens-metrics   node-exporter-k8tvc                                     1/1     Running            1 (8d ago)       33d
lens-metrics   prometheus-0                                            1/1     Running            0                11h

$ kubectl -n clearml logs clearml-server-redis-master-0
$

The output is empty.

jkhenning commented 2 years ago

@Heegreis as the information box at the title of Kubernetes Using Helm | ClearML says, this documentation is being updated - instead, you should use the instructions available at this repository (https://github.com/allegroai/clearml-helm-charts)

Heegreis commented 2 years ago

@jkhenning I know. In fact, I mainly refer to the instructions at this repository, and the doc part (Kubernetes Using Helm | ClearML) is used as an aid. Thanks for reminding.

valeriano-manassero commented 2 years ago

I see Redis, Elastic and MongoDB in Pending state, usually this is due to resources available but let's check it. Can you pls post the result of following commands?

kubectl -n clearml describe po clearml-server-redis-master-0
kubectl -n clearml describe po clearml-elastic-master-0
kubectl -n clearml clearml-server-mongodb-86648c4756-bxfg9

Ty.

Heegreis commented 2 years ago

@valeriano-manassero Hi, below is the output.

$ kubectl -n clearml describe po clearml-server-redis-master-0
Name:           clearml-server-redis-master-0
Namespace:      clearml
Priority:       0
Node:           <none>
Labels:         app=redis
                chart=redis-10.9.0
                controller-revision-hash=clearml-server-redis-master-7b549db9bd
                release=clearml-server
                role=master
                statefulset.kubernetes.io/pod-name=clearml-server-redis-master-0
Annotations:    checksum/configmap: 8ae44a85458a357715b5a5ea9ec94c775b8d98f0bb8ee0ad50289a9b57338fb1
                checksum/health: b30bd6fdb77ce2c8622ddcdc3263f0eb49f2190fae0b407b365ecf35eca603a7
                checksum/secret: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/clearml-server-redis-master
Containers:
  redis:
    Image:      docker.io/bitnami/redis:6.0.8-debian-10-r0
    Port:       6379/TCP
    Host Port:  0/TCP
    Command:
      /bin/bash
      -c
      if [[ -n $REDIS_PASSWORD_FILE ]]; then
        password_aux=`cat ${REDIS_PASSWORD_FILE}`
        export REDIS_PASSWORD=$password_aux
      fi
      if [[ ! -f /opt/bitnami/redis/etc/master.conf ]];then
        cp /opt/bitnami/redis/mounted-etc/master.conf /opt/bitnami/redis/etc/master.conf
      fi
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-27tx9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                      From               Message
  ----     ------            ----                     ----               -------
  Warning  FailedScheduling  116s (x6685 over 4d19h)  default-scheduler  0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.

$ kubectl -n clearml describe po clearml-elastic-master-0
Name:           clearml-elastic-master-0
Namespace:      clearml
Priority:       0
Node:           <none>
Labels:         app=clearml-elastic-master
                chart=elasticsearch
                controller-revision-hash=clearml-elastic-master-f5d695b86
                release=clearml-server
                statefulset.kubernetes.io/pod-name=clearml-elastic-master-0
Annotations:    configchecksum: 74bf3a32b86b711225b81f59050eb46d9c7e332399326f6fd4ee8627b4febfa
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/clearml-elastic-master
Init Containers:
  configure-sysctl:
    Image:      docker.elastic.co/elasticsearch/elasticsearch:7.10.1
    Port:       <none>
    Host Port:  <none>
    Command:
      sysctl
      -w
      vm.max_map_count=262144
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-svrqd (ro)
Containers:
  elasticsearch:
    Image:       docker.elastic.co/elasticsearch/elasticsearch:7.10.1
    Ports:       9200/TCP, 9300/TCP
    Host Ports:  0/TCP, 0/TCP
    Limits:
      cpu:     1
      memory:  4Gi
    Requests:
      cpu:      1
      memory:   4Gi
    Readiness:  exec [sh -c #!/usr/bin/env bash -e
# If the node is starting up wait for the cluster to be ready (request params: "wait_for_status=yellow&timeout=1s" )
# Once it has started only check that the node itself is responding
START_FILE=/tmp/.es_start_file

# Disable nss cache to avoid filling dentry cache when calling curl
# This is required with Elasticsearch Docker using nss < 3.52
export NSS_SDB_USE_CACHE=no

http () {
  local path="${1}"
  local args="${2}"
  set -- -XGET -s

  if [ "$args" != "" ]; then
    set -- "$@" $args
  fi

  if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
    set -- "$@" -u "${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
  fi

  curl --output /dev/null -k "$@" "http://127.0.0.1:9200${path}"
}

if [ -f "${START_FILE}" ]; then
  echo 'Elasticsearch is already running, lets check the node is healthy'
  HTTP_CODE=$(http "/" "-w %{http_code}")
  RC=$?
  if [[ ${RC} -ne 0 ]]; then
    echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with RC ${RC}"
    exit ${RC}
  fi
  # ready if HTTP code 200, 503 is tolerable if ES version is 6.x
  if [[ ${HTTP_CODE} == "200" ]]; then
    exit 0
  elif [[ ${HTTP_CODE} == "503" && "7" == "6" ]]; then
    exit 0
  else
    echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with HTTP code ${HTTP_CODE}"
    exit 1
  fi

else
  echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=yellow&timeout=1s" )'
  if http "/_cluster/health?wait_for_status=yellow&timeout=1s" "--fail" ; then
    touch ${START_FILE}
    exit 0
  else
    echo 'Cluster is not yet ready (request params: "wait_for_status=yellow&timeout=1s" )'
    exit 1
  fi
fi
] delay=10s timeout=5s period=10s #success=3 #failure=3
    Environment:
      node.name:                                                     clearml-elastic-master-0 (v1:metadata.name)
      cluster.initial_master_nodes:                                  clearml-elastic-master-0,
      discovery.seed_hosts:                                          clearml-elastic-master-headless
      cluster.name:                                                  clearml-elastic
      network.host:                                                  0.0.0.0
      ES_JAVA_OPTS:                                                  -Xmx2g -Xms2g
      node.data:                                                     true
      node.ingest:                                                   true
      node.master:                                                   true
      node.remote_cluster_client:                                    true
      bootstrap.memory_lock:                                         false
      cluster.routing.allocation.node_initial_primaries_recoveries:  500
      cluster.routing.allocation.disk.watermark.low:                 500mb
      cluster.routing.allocation.disk.watermark.high:                500mb
      cluster.routing.allocation.disk.watermark.flood_stage:         500mb
      http.compression_level:                                        7
      reindex.remote.whitelist:                                      *.*
      xpack.monitoring.enabled:                                      false
      xpack.security.enabled:                                        false
    Mounts:
      /usr/share/elasticsearch/config/elasticsearch.yml from esconfig (rw,path="elasticsearch.yml")
      /usr/share/elasticsearch/data from clearml-elastic-master (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-svrqd (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  clearml-elastic-master:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  clearml-elastic-master-clearml-elastic-master-0
    ReadOnly:   false
  esconfig:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      clearml-elastic-master-config
    Optional:  false
  kube-api-access-svrqd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                       From               Message
  ----     ------            ----                      ----               -------
  Warning  FailedScheduling  4m48s (x6685 over 4d19h)  default-scheduler  0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.

$ kubectl -n clearml describe po clearml-server-mongodb-86648c4756-bxfg9
Name:           clearml-server-mongodb-86648c4756-bxfg9
Namespace:      clearml
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/component=mongodb
                app.kubernetes.io/instance=clearml-server
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=mongodb
                helm.sh/chart=mongodb-10.3.4
                pod-template-hash=86648c4756
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/clearml-server-mongodb-86648c4756
Containers:
  mongodb:
    Image:      docker.io/bitnami/mongodb:4.4.3-debian-10-r0
    Port:       27017/TCP
    Host Port:  0/TCP
    Liveness:   exec [mongo --eval db.adminCommand('ping')] delay=30s timeout=5s period=10s #success=1 #failure=6
    Readiness:  exec [mongo --eval db.adminCommand('ping')] delay=5s timeout=5s period=10s #success=1 #failure=6
    Environment:
      BITNAMI_DEBUG:                    false
      ALLOW_EMPTY_PASSWORD:             yes
      MONGODB_SYSTEM_LOG_VERBOSITY:     0
      MONGODB_DISABLE_SYSTEM_LOG:       no
      MONGODB_ENABLE_IPV6:              no
      MONGODB_ENABLE_DIRECTORY_PER_DB:  no
    Mounts:
      /bitnami/mongodb from datadir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bhvbw (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  clearml-server-mongodb
    ReadOnly:   false
  kube-api-access-bhvbw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  57s (x6694 over 4d19h)  default-scheduler  0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.

valeriano-manassero commented 2 years ago

0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.

PVC are not ready, I guess the issue is with StorageClass used; by default standard is set in values but you need to check a dynamic storage provisioner is up and running in your cluster and set it to default. In systems created by KinD everything is set by default.

Heegreis commented 2 years ago

Thanks for replying.

I roughly understand the cause. But I am new to k8s, and I am currently stuck in setting the provisioner.
Because I am not using cloud services, I may also need to set up NFS or local provisioner, etc.

I have successfully used docker-compose to build ClearML Server and cooperated with clearml-agemt k8s glue. So I do not set up ClaerML Server through Helm for now.

amirhmk commented 1 year ago

Did you ever figure out this error? I'm also facing the exact same issue with PersistentVolumeClaims. Not sure how to resolve this as I'm not using kind either.

allegroai / clearml-helm-charts

apiserver error: clearml-redis-master:6379. Connection refused #34