bitpoke / mysql-operator

Asynchronous MySQL Replication on Kubernetes using Percona Server and Openark's Orchestrator.
https://www.bitpoke.io/docs/mysql-operator/getting-started/
Apache License 2.0
1.04k stars 275 forks source link

Pod failing to initialize due to no prior node #250

Closed pedep closed 5 years ago

pedep commented 5 years ago

I have made a 3-node mysql cluster to have a play around with mysql-operator

When draining the node containing mysql-0, it seems to be unable to restore from a sibling/master in the cluster after the pod has been rescheduled on another node. When inspecting, the sidecar errors with this message https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L73

Since i am using emptyDir, the clone-mysql sidecar should download from the current master, or a sibling, but due to the serverId being 100, it goes straight to the error-message above. https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L65


It seems some kind of recovery option for pod 0 is needed. I would suggest something along the lines of this

if util.GetServerID() > 100 {
    sourceHost := util.GetHostFor(util.GetServerID() - 1)
    err := cloneFromSource(sourceHost)
    if err != nil {
        return fmt.Errorf("failed to clone from %s, err: %s", sourceHost, err)
    }
+} else if util.GetServerID() == 100 {
+   sourceHost := util.GetMasterHost()
+   err := cloneFromSource(sourceHost)
+   if err != nil {
+       return fmt.Errorf("failed to clone from %s, err: %s", sourceHost, err)
+   }
} else {
    return fmt.Errorf(
        "failed to initialize because no of no prior node exists, check orchestrator maybe",
    )
}

I dont think this will result in the pod trying to connect to itself for recovery, due to this check above https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L52


Easiest way to reproduce this behaviour is to create a new cluster with volumeSpec.emptyDir: {} and a few replicas, and delete the my-cluster-mysql-0 pod.

AMecea commented 5 years ago

Nice catch, @pedep ! Indeed this is a bug, I didn't test too much with emptyDir.

I think your patch should fix this issue.

I will be happy to review and merge a PR with the fix.

pedep commented 5 years ago

@AMecea Thanks :smile:

I will try my hand at a PR in a moment