OT-CONTAINER-KIT / redis-operator

A golang based redis operator that will make/oversee Redis standalone/cluster/replication/sentinel mode setup on top of the Kubernetes.
https://ot-redis-operator.netlify.app/
Apache License 2.0
742 stars 207 forks source link

Reshard/validate sharding of clusters on node reboot/every so often. #773

Open deefdragon opened 5 months ago

deefdragon commented 5 months ago

Is your feature request related to a problem? Please describe. Ive had a few instances of having to completely rebuild my cluster after rebooting my servers (I run k8s on local hardware). I believe that I have finally determined that this is due to sharding issues with the cluster after a reboot (CLUSTERDOWN Hash slot not served). It appears that clusters can reshard when scaled down, but are not validated as properly sharded when rebooted.

Describe the solution you'd like it would be nice of the operator could check and validate the sharding every so often/when a node reboots.

Describe alternatives you've considered I don't currently have any long-term data stored in the cluster, so deleting the storage allows everything to rebuild properly, but that's obviously less than ideal as it requires extra down time and manual intervention.

I would try to prevent the cluster from going down in the first place, but its sometimes unavoidable during maintenance.

What version of redis-operator are you using?

redis-operator version: 0.15.1

Additional context Manifest of the redis cluster (its terraform, but its basically KV to the yaml of the manifest.)

resource "kubernetes_manifest" "redis_prod_cluster" {
  manifest = {
    apiVersion = "redis.redis.opstreelabs.in/v1beta2"
    kind       = "RedisCluster"
    metadata = {
      name      = "redis-prod-cluster"
      namespace = kubernetes_namespace.redis_namespace.metadata[0].name
    }
    spec = {
      kubernetesConfig = {
        image           = "quay.io/opstree/redis:v7.0.12"
        imagePullPolicy = "IfNotPresent"

        redisSecret = {
          name = kubernetes_secret.redis_prod_password.metadata[0].name
          key  = "password"
        }
        service = {
          serviceType = "NodePort"
        }

      }
      resources = {
        limits = {
          memory = "200Mi"
          cpu    = "100m"
        }
      }

      persistenceEnabled = false
      podSecurityContext = {
        fsGroup   = 0
        runAsUser = 0
      }
      storage = {
        volumeClaimTemplate = {
          spec = {
            accessModes = [
              "ReadWriteOnce",
            ]
            resources = {
              requests = {
                storage = "10Gi"
              }
            }
          }
        }
      }
      clusterSize = 3
      redisLeader = {
        replicas = 3
        securityContext = {
          # fsGroup    = 0
          runAsGroup = 0
          runAsUser  = 0
        }
      }
      redisFollower = {
        replicas = 6
        securityContext = {
          # fsGroup    = 0
          runAsGroup = 0
          runAsUser  = 0
        }
      }
      redisExporter = {
        enabled = true
        image   = "quay.io/opstree/redis-exporter:v1.44.0"
      }
    }
  }
}
drivebyer commented 4 months ago

Thank you for your feedback. You may want to try using the redis-cli --cluster fix command.