ChrisHirsch / kuberhealthy-storage-check

A storage check for the kuberhealthy project
Apache License 2.0
10 stars 12 forks source link


Please see the parent project (needed to run this check) at Kuberhealthy

Storage Check

This check tests if a persistent volume claim (PVC) can be created and used within your Kubernetes cluster. It will attempt to create a PVC using either the default storage class (SC) or a user specified one. When the PVC is successfully created, the check will initialize the storage with a known value and then discover the nodes in the cluster and attempt to use that PVC on each node. If the check can create a PVC, initialize the PVC and use/mount and verify the contents of the storage on each discovered (or explicitly allowed/ignored) node then the check will succeed and you can have confidence in your ability to mount storage on nodes that are allowed to be scheduled.

Once the contents of the PVC have been validated, the check Job, the init Job and the PVC will be cleaned up and the check will be marked as successful.

Container resource requests are set to 15 millicores of CPU and 20Mi units of memory and use the Alpine image alpine:3.11 for the Job and a default of 1Gi for the PVC. If the environment variable CHECK_STORAGE_PVC_SIZE is set then the value of that will be used instead of the default.

By default, the nodes of the cluster will be discovered and only those nodes that are untainted (or has taints that are all specified in CHECK_TOLERATIONS), in a Ready state and not in the role of master will be used. If node(s) need to be ignored for whatever reason, then the environment variable CHECK_STORAGE_IGNORED_CHECK_NODES should be used a space or comma separated list of nodes should be supplied. If auto-discovery is not desired, the environment variable CHECK_STORAGE_ALLOWED_CHECK_NODES can be used and a space or comma separated list of nodes that should be checked needs to be supplied. If CHECK_STORAGE_ALLOWED_CHECK_NODES is supplied and a node in that list matches a node in the ignored (CHECK_STORAGE_IGNORED_CHECK_NODES) list then that node will be ignored.

By default, the storage check Job and initialize storage check Job will use Alpine's alpine:3.11 image. If a different image is desired, use the environment variable CHECK_STORAGE_IMAGE or CHECK_STORAGE_INIT_IMAGE depending on which image should be changed.

Initializing the storage is pretty simple and a file with the contents of storage-check-ok is created as /data/index.html. There is no reason it's called index.html except maybe for future additional checks. If the storage initialization should be done differently, or needs to be more complex, the option to use a completely different image exists as described above (CHECK_STORAGE_INIT_IMAGE). To override the arguments used to create the known data, use the environment variable CHECK_STORAGE_INIT_COMMAND_ARGS and change it from the default of echo storage-check-ok > /data/index.html && ls -la /data && cat /data/index.html.

Checking the storage is also pretty simple (there is a theme here). The check simply mounts the PVC at /data, cats the /data/index.html file and pipes the output to grep looking for the contents of storage-check-ok. If it sees that, the exit code will be 0 and the check passes for that particular node. Because the Pod on the node could mount the storage, see the previously created file AND see the known contents of the file the check is OK. If a more advanced check is desired, the entire imaged can be changed with the environment variable CHECK_STORAGE_IMAGE or to just change the command line arguments use CHECK_STORAGE_COMMAND_ARGS and change from the default of ls -la /data && cat /data/index.html && cat /data/index.html | grep storage-check-ok.

Custom images can be used for this check and can be specified with the CHECK_STORAGE_IMAGE and CHECK_STORAGE_INIT_IMAGE environment variables as described above. If a custom image requires the use of environment variables, they can be passed down into the custom container by setting the environment variable ADDITIONAL_ENV_VARS to a string of comma-separated values ("X=foo,Y=bar").

A successful run implies that a PVC was successfully created and Bound, a Storage init Job was able to use the PVC and correctly initialize it with known data, and all schedulable nodes were able to run the check Job with the mounted PVC and validate the contents. A failure implies that an error or timeout occurred anywhere in the PVC request, Init Job creation, Check Job creation, validation of known data, or tear down process -- resulting in an error report to the Kuberhealthy status page.

Storage Check Diagram

Animated Gif generated by: Tall Tweets

Check Steps

This check follows the list of actions in order during the run of the check:

  1. Looks for old storage check job, storage init job, and PVC belonging to this check and cleans them up.
  2. Creates a PVC in the namespace and waits for the PVC to be ready.
  3. Creates a storage init configuration, applies it to the namespace, and waits for the storage init job to come up and initialize the PVC with known data.
  4. Determine which nodes in the cluster are going to run the storage check by auto-discovery or a list supplied nodes via the CHECK_STORAGE_IGNORED_CHECK_NODES and CHECK_STORAGE_ALLOWED_CHECK_NODES environment variables. Nodes with taints will not be included unless the toleration is configured in CHECK_TOLERATIONS.
  5. For each node that needs a check, creates a storage check configuration, applies it to the namespace, and waits for the storage check job to start and validate the contents of storage on each desired node.
  6. Tear everything down once completed.

Check Details

Example KuberhealthyStorageCheck Spec

The following configuration will create a storage check for all non-master nodes except node4 using the VMware vsan-default storage class:

kind: KuberhealthyCheck
  name: storage-check
  namespace: kuberhealthy
  runInterval: 5m
  timeout: 10m
    - env: 
        - name: CHECK_STORAGE_NAME
          value: "mysuperfuntime-pv-claim"
          value: "vsan-default"
          value: "node4"
      image: chrishirsch/kuberhealthy-storage-check:v0.0.2
      imagePullPolicy: IfNotPresent
      name: main
          cpu: 10m
          memory: 50Mi
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
    restartPolicy: Never
    serviceAccountName: storage-sa
      runAsUser: 999
      fsGroup: 999
apiVersion: v1
kind: ServiceAccount
  name: storage-sa
  namespace: kuberhealthy
kind: Role
  name: storage-role
  namespace: kuberhealthy
  - apiGroups:
      - ""
      - services
      - persistentvolumeclaims
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
  - apiGroups:
      - "batch"
      - "extensions"
      - jobs
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
  - apiGroups:
      - ""
      - pods
      - get
      - list
      - watch
kind: ClusterRole
  name: kuberhealthy-storage-cr
  - apiGroups:
      - ""
      - nodes
      - get
      - list
      - watch
kind: ClusterRoleBinding
  name: kuberhealthy-storage-crb
    kind: ClusterRole
    name: kuberhealthy-storage-cr
  - kind: ServiceAccount
    name: storage-sa
    namespace: kuberhealthy
kind: RoleBinding
  name: storage-rb
  namespace: kuberhealthy
  kind: Role
  name: storage-role
  - kind: ServiceAccount
    name: storage-sa


To use the Storage Check with Kuberhealthy, apply the configuration file storage-check.yaml to your Kubernetes Cluster. The following command will also apply the configuration file to your current context:

kubectl apply -f

Make sure you are using the latest release of Kuberhealthy 2.0.0 or later.

The check configuration file contains:

The role, rolebinding, clusterrole, clusterrolebinding and service account are all required to create and delete all PVCs and jobs from the check in the given namespaces you install the check for. The assumed default service account does not provide enough permissions for this check to run.

Go Run Gosec