Orange-OpenSource / galera-operator

Galera Operator automates tasks for managing a Galera cluster in Kubernetes
Apache License 2.0
34 stars 18 forks source link

Is it possible to change readiness and liveness parameters? #20

Open bruno-lopes opened 3 years ago

bruno-lopes commented 3 years ago

I would like to change the readiness and liveness parameters of the galera instances. At this moment, the sst is trying to recover the database data, but the liveness healthcheck is killing the container, because it is not ready.

bruno-lopes commented 3 years ago

I found this on the code:

func galeraProbe(user, password string) *corev1.Probe {
    cmd := []string{"mysqladmin"}
    cmd = append(cmd, fmt.Sprintf("-u%s", user))
    cmd = append(cmd, fmt.Sprintf("-p%s", password))
    cmd = append(cmd, "ping")

    return &corev1.Probe{
        Handler: corev1.Handler{
            Exec: &corev1.ExecAction{
                Command: cmd,
            },
        },
        InitialDelaySeconds: 30,
//      TimeoutSeconds:      1,
//      PeriodSeconds:       10,
//      SuccessThreshold:    1,
//      FailureThreshold:    3,
    }
}

You are passing user and password as arguments. They come from environment variables. If it was possible to use environment variables to personalize the healthchecks, it would be very nice.

sebs42 commented 3 years ago

I would like to change the readiness and liveness parameters of the galera instances. At this moment, the sst is trying to recover the database data, but the liveness healthcheck is killing the container, because it is not ready.

Can you explain what you are doing ? Because the mariadb image takes a lot of time to start but it is not a problem for readiness and liveness probes. Are you using a custom image ? Are you doing something before mariadb is starting ?

You need to start a db and if you need to put datas or create tables, you should do it after and not before the initialization.

bruno-lopes commented 3 years ago

@sebs42 , thanks for your quickly response. I started the operator, and everything went well. I loaded the data in it (~90Gb). But after sometime, I had a problem with the persistent volumes of the pod that contains the backup container. So, I erased them, and the operator recreated them.

The sst started correctly, but after the timeouts of the healtcheck, the pod was killed and restarted, causing a loop. So, I had to manually regenerate the image of the container increasing "PeriodSeconds" and "FailureThreshold" parameters. With this, the sst process ended with success.

I know that perhaps the operator was not designed to do that as you mentioned in your response, but if somebody has a similar situation, I think it would be nice to be easy to inform those values using environment variables.

Feel free to close this issue if it is not a relevant feature. Thanks!

sebs42 commented 3 years ago

@bruno : have you some logs to share ? Can you describe which parameters you changed (which values for periodseconds and failurethreshold ?). I understand you have started a pod manually with other values to join the gallera group : is it correct ? It is strange because liveness and readiness re not connected with sst. Probes are only here to know if the sql engine started, and if it is not synchronized, no other actions are started until the new node is synchronized with the other members to the galleria cluster.

bruno-lopes commented 3 years ago

Sorry, I don't have the logs. No, i have not started a pod manually. I deployed the operator and it created all the pods. Them, I loaded the database (~90GB). After sometime, we had to change some of our instances on AWS to anotther region, and we could not use the volumes that were created before.

So we killed two pods (of three) and let the operator recreate them (and their volumes), but in the pod initialization, the sst has started to download the data from the original pod (that we have not killed). But the healthcheck after sometime restarted the pod before it finished the download process.

I had to increase PeriodSeconds and FailureThreshold, altering the galeraProbe function this way:

func galeraProbe(user, password string) *corev1.Probe {
    cmd := []string{"mysqladmin"}
    cmd = append(cmd, fmt.Sprintf("-u%s", user))
    cmd = append(cmd, fmt.Sprintf("-p%s", password))
    cmd = append(cmd, "ping")

    return &corev1.Probe{
        Handler: corev1.Handler{
            Exec: &corev1.ExecAction{
                Command: cmd,
            },
        },
        InitialDelaySeconds: 30,
//      TimeoutSeconds:      1,
        PeriodSeconds:       1000,  // To increase time between healthcheck verifications
//      SuccessThreshold:    1,
        FailureThreshold:    30000, // To allow more failures before killing the pod
    }
}

I build the operator and deployed it, and after that the pods from galera could start and be ready.

But I agree that this is not a common situation.