Pod failing to initialize due to no prior node

pedep commented 5 years ago

I have made a 3-node mysql cluster to have a play around with mysql-operator

When draining the node containing mysql-0, it seems to be unable to restore from a sibling/master in the cluster after the pod has been rescheduled on another node. When inspecting, the sidecar errors with this message https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L73

Since i am using emptyDir, the clone-mysql sidecar should download from the current master, or a sibling, but due to the serverId being 100, it goes straight to the error-message above. https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L65

It seems some kind of recovery option for pod 0 is needed. I would suggest something along the lines of this

if util.GetServerID() > 100 {
    sourceHost := util.GetHostFor(util.GetServerID() - 1)
    err := cloneFromSource(sourceHost)
    if err != nil {
        return fmt.Errorf("failed to clone from %s, err: %s", sourceHost, err)
    }
+} else if util.GetServerID() == 100 {
+   sourceHost := util.GetMasterHost()
+   err := cloneFromSource(sourceHost)
+   if err != nil {
+       return fmt.Errorf("failed to clone from %s, err: %s", sourceHost, err)
+   }
} else {
    return fmt.Errorf(
        "failed to initialize because no of no prior node exists, check orchestrator maybe",
    )
}

I dont think this will result in the pod trying to connect to itself for recovery, due to this check above https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L52

Easiest way to reproduce this behaviour is to create a new cluster with volumeSpec.emptyDir: {} and a few replicas, and delete the my-cluster-mysql-0 pod.

AMecea commented 5 years ago

Nice catch, @pedep ! Indeed this is a bug, I didn't test too much with emptyDir.

I think your patch should fix this issue.

I will be happy to review and merge a PR with the fix.

pedep commented 5 years ago

@AMecea Thanks :smile:

I will try my hand at a PR in a moment

bitpoke / mysql-operator

Pod failing to initialize due to no prior node #250