Dynamic addresses - Githubissues

Hi, I noticed that the addresses of the nodes are saved in a file. I am experimenting with container deployments and those usually don't come with fixed repeatable IPs. For example, docker compose or Kubernetes stateful set.

I was trying to use DNS as the address of the leader to connect to, but the nodes look up the IP of this hostname and remember it.

The problem is, if the IPs change after restart or due to other reasons, the nodes are not able to start any more.

I am unsure how this could be solved. Likewise, I am thinking of deleting the files from the data volumes which contain this information before starting the node, but I don't know what file the actual data is in and what file can be deleted.

Is there some recommended way to deal with this scenario? I would assume it is not uncommon these days to deploy apps in a Kubernetes cluster, for example.

Well, I just noticed I was too quick to judge. It is true that it throws a bunch of warnings at the beginning because the old leader isn't around any more, but after a moment, the nodes have elected a new leader, and it's business as usual.

Although, I am not sure if its problematic, if some of the same IPs are assigned. For example, node 1 was leader before with a specific ip but after restart the ip is assigned to node 2. This can be observed in these logs.

``` Attaching to app-0, app-1, app-2 app-0 | 2022/05/01 13:38:54 own ip: 172.19.0.2, cluster members [] app-0 | 2022/05/01 13:38:54 WARN: attempt 0: server 172.19.0.2:9000: no known leader app-1 | 2022/05/01 13:38:55 own ip: 172.19.0.3, cluster members [172.19.0.2:9000] app-1 | 2022/05/01 13:38:55 WARN: attempt 0: server 172.19.0.2:9000: no known leader app-0 | 2022/05/01 13:38:55 WARN: attempt 0: server 172.19.0.3:9000: no known leader app-1 | 2022/05/01 13:38:55 WARN: attempt 0: server 172.19.0.3:9000: no known leader app-2 | 2022/05/01 13:38:56 own ip: 172.19.0.4, cluster members [172.19.0.3:9000 172.19.0.2:9000] app-0 | 2022/05/01 13:38:56 WARN: attempt 0: server 172.19.0.4:9000: dial: dial tcp 172.19.0.4:9000: connect: connection refused app-1 | 2022/05/01 13:38:56 WARN: attempt 0: server 172.19.0.4:9000: dial: dial tcp 172.19.0.4:9000: connect: connection refused app-2 | 2022/05/01 13:38:56 WARN: attempt 0: server 172.19.0.2:9000: no known leader app-2 | 2022/05/01 13:38:56 WARN: attempt 0: server 172.19.0.3:9000: no known leader app-2 | 2022/05/01 13:38:56 WARN: attempt 0: server 172.19.0.4:9000: no known leader app-0 | 2022/05/01 13:38:57 WARN: attempt 1: server 172.19.0.2:9000: no known leader app-0 | 2022/05/01 13:38:57 WARN: attempt 1: server 172.19.0.3:9000: no known leader app-0 | 2022/05/01 13:38:57 WARN: attempt 1: server 172.19.0.4:9000: no known leader app-1 | 2022/05/01 13:38:57 WARN: attempt 1: server 172.19.0.2:9000: no known leader app-1 | 2022/05/01 13:38:57 WARN: attempt 1: server 172.19.0.3:9000: no known leader app-1 | 2022/05/01 13:38:57 WARN: attempt 1: server 172.19.0.4:9000: no known leader app-2 | 2022/05/01 13:38:57 WARN: attempt 1: server 172.19.0.2:9000: no known leader app-2 | 2022/05/01 13:38:57 WARN: attempt 1: server 172.19.0.3:9000: no known leader app-2 | 2022/05/01 13:38:57 WARN: attempt 1: server 172.19.0.4:9000: no known leader app-0 | 2022/05/01 13:38:57 WARN: attempt 2: server 172.19.0.2:9000: no known leader app-0 | 2022/05/01 13:38:57 WARN: attempt 2: server 172.19.0.3:9000: no known leader app-0 | 2022/05/01 13:38:57 WARN: attempt 2: server 172.19.0.4:9000: no known leader app-1 | 2022/05/01 13:38:57 WARN: attempt 2: server 172.19.0.2:9000: no known leader app-1 | 2022/05/01 13:38:57 WARN: attempt 2: server 172.19.0.3:9000: no known leader app-1 | 2022/05/01 13:38:57 WARN: attempt 2: server 172.19.0.4:9000: no known leader app-2 | 2022/05/01 13:38:57 WARN: attempt 2: server 172.19.0.2:9000: no known leader app-2 | 2022/05/01 13:38:57 WARN: attempt 2: server 172.19.0.3:9000: no known leader app-2 | 2022/05/01 13:38:57 WARN: attempt 2: server 172.19.0.4:9000: no known leader app-0 | 2022/05/01 13:38:58 WARN: attempt 3: server 172.19.0.2:9000: no known leader app-0 | 2022/05/01 13:38:58 WARN: attempt 3: server 172.19.0.3:9000: no known leader app-0 | 2022/05/01 13:38:58 WARN: attempt 3: server 172.19.0.4:9000: no known leader app-1 | 2022/05/01 13:38:58 WARN: attempt 3: server 172.19.0.2:9000: no known leader app-1 | 2022/05/01 13:38:58 WARN: attempt 3: server 172.19.0.3:9000: no known leader app-1 | 2022/05/01 13:38:58 WARN: attempt 3: server 172.19.0.4:9000: no known leader app-2 | 2022/05/01 13:38:58 WARN: attempt 3: server 172.19.0.2:9000: no known leader app-2 | 2022/05/01 13:38:58 WARN: attempt 3: server 172.19.0.3:9000: no known leader app-2 | 2022/05/01 13:38:58 WARN: attempt 3: server 172.19.0.4:9000: no known leader app-0 | 2022/05/01 13:38:59 DEBUG: attempt 4: server 172.19.0.2:9000: connected app-1 | 2022/05/01 13:38:59 DEBUG: attempt 4: server 172.19.0.2:9000: connected app-0 | 2022/05/01 13:38:59 DEBUG: attempt 0: server 172.19.0.2:9000: connected app-0 | 2022/05/01 13:38:59 starting server app-1 | 2022/05/01 13:38:59 DEBUG: attempt 0: server 172.19.0.2:9000: connected app-1 | 2022/05/01 13:38:59 starting server app-2 | 2022/05/01 13:38:59 DEBUG: attempt 4: server 172.19.0.2:9000: connected app-2 | 2022/05/01 13:38:59 DEBUG: attempt 0: server 172.19.0.2:9000: connected app-2 | 2022/05/01 13:38:59 starting server ```

I am using a function like this to know the IPs to connect to. Providing the DNS name of the headless service or, in this case, the shared docker network alias.

func clusterAddresses(ctx context.Context, dns, port string) (string, []string, error) {
    host, err := os.Hostname()
    if err != nil {
        return "", nil, err
    }

    r := net.Resolver{}

    ips, err := r.LookupHost(ctx, host)
    if err != nil {
        return "", nil, err
    }
    if len(ips) == 0 {
        return "", nil, fmt.Errorf("no IPs found for %s", dns)
    }
    ownIp := ips[0]

    ips, err = r.LookupHost(ctx, dns)
    if err != nil {
        return "", nil, err
    }

    clusterIps := make([]string, 0, len(ips)-1)
    for _, ip := range ips {
        if ip == ownIp {
            continue
        }
        clusterIps = append(clusterIps, net.JoinHostPort(ip, port))
    }

    log.Printf("own ip: %s, cluster members %v", ownIp, clusterIps)

    return net.JoinHostPort(ownIp, port), clusterIps, nil
}

And then I start the app like this:

ownAddr, members, err := clusterAddresses(ctx, clusterDNS, sqlPort)
if err != nil {
    return err
}
dqlite, err := app.New(
    dataDir,
    app.WithAddress(ownAddr),
    app.WithCluster(members),
)

You can use either an IP or a resolvable DNS name as argument of the app.WithAddress() option. However, once you set that IP or DNS name the first time you start the node, then the same IP or DNS name must be used at every restart.

I believe both Kubernetes and Docker have options to do this, for example etcd works the same way (and pretty much any service based on Raft). You mention stateful sets in Kubernetes, those should provide stable IPs, or am I wrong?

See the k8s docs: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id

WithAdress didn't work with dns from my testing. You can use dns in WithCluster, but not in WithAddress. But it seems to do a 1 time lookup and store the IP it found, so there is not much gain in doing this. I am also not sure if it would confuse the node if it had its own address in the slice that is passed to WithCluster, because that's what would happen if you just stuck in the headless-service dns. Hence, I am filtering it out in my code.

Yes, I am referring to the provided link, but what is described as stable network ID there is actually regarding DNS. It's about the fact that the same pod with the same state (volume etc) will always be i.e. myapp-0, or myapp-1. They will never change, and they have no random string suffix like pods from a deployment. There is no guarantee that the same IP will be assigned to the pod, though, at least per my understanding.

It's true, you can assign subnets and IPs within subnets to containers in docker. I am not sure if that works in Kubernetes too. I know you can do it for k8s services, but I haven't seen this for pods. Usually, cluster type of applications have shared headless services that return all the IPs of the pods of the stateful set when doing a dns lookup, and they can find each other like that. My code above was created with this kind of thing in mind.

From my experimentation, it seems like go-dqlite nodes are able to recover from messed up IPs. It takes a couple seconds, and then they run OK, it appears. It's just at first there is no leader and hey have to do an election it seems.

I will research if it's possible to assign static IPs to pods, however this is error-prone as there may not be a way to guarantee that the IP won't be taken at some point by something else.

Oh, I think that dns was not working with your testing because it only works if you also enable TLS (e.g. with app.WithTLS(app.SimpleTLSConfig(cert, pool))). This is arguably a bug, but it's the state of things right now.

Do you have a chance of configuring your dqlite app with TLS? With that in place you should use the stable network ID (dns name) provided by kubernetes.

See this test setup function for an example of how to configure your app object with TLS.

I will experiment with that, thanks for the hint. But I think its not working because it wants to bind a network interface. You cannot bind a network interface with a hostname or dns afaik. You need the actual ip. Nameservers point to the IP of the network interface via dns records.

It works with WithAddress because there its doing a dns lookup. Because it doesn't have to bind the addresses provided there. But even then its seems to be a 1 time operation, afterwards the IPs from the dns lookup are stored in the yaml file.

But again, I will check this out in depth and report back.

And there's some more info in the SimpleTLSConfig() docstring too.

I will experiment with that, thanks for the hint. But I think its not working because it wants to bind a network interface. You cannot bind a network interface with a hostname or dns afaik.

Right, but if you use the WithTLS() option, then the code will pass the value of WithAddress() to the golang stdlib net.Listen() function, which should automatically resolve hostnames. At least this is what I recall. Please let me know if it's not the case.

Changing the value that you pass to WithAddress() might have worked for you, but I'm not entirely sure it's safe to do, it could lead to subtle issues. I'd need to investigate further to confirm. Anyway, what we officially support is static IPs or DNS names (which as I mentioned is pretty much a requirement of the Raft algorithm afaict).

It stores dns names in the config with this.

app.WithTLS(app.SimpleTLSConfig(cert, pool)),
app.WithAddress(net.JoinHostPort(nodeDNS, sqlPort)),
app.WithCluster(statefulZeroDNS),

But I have issues with the nodes accepting each other's certificate.

app-1  | 2022/05/01 22:46:45 WARN: attempt 2: server app-0:9000: write handshake: x509: certificate is valid for app-0, not app-1
app-0  | 2022/05/01 22:46:45 ERROR: proxy: first: remote -> local: remote error: tls: bad certificate

I used the openssl command from the comments there, to generate a cert for each apps name like app-0 and app-1.

Since they are self-signed, I would think they all need to share the same CA or at least have access to the CA, so they can validate the cert of the other apps. OR should this be the same cert for all apps with all the dns names inside? like dns.1 dns.2 in subject alternate names?

The code looks like this.

cert, err := tls.LoadX509KeyPair("cluster.crt", "cluster.key")
if err != nil {
    return err
}
data, err := ioutil.ReadFile("cluster.crt")
if err != nil {
    return err
}
pool := x509.NewCertPool()
pool.AppendCertsFromPEM(data)

nodeDNS, ok := os.LookupEnv("NODE_DNS")
if !ok {
    return fmt.Errorf("CLUSTER_DNS not set")
}

var statefulZeroDNS []string
if !strings.HasSuffix(nodeDNS, "-0") {
    dns := regexp.MustCompile(`-\d+$`).ReplaceAllString(nodeDNS, "-0")
    statefulZeroDNS = []string{net.JoinHostPort(dns, sqlPort)}
}

dqlite, err := app.New(
    dataDir,
    app.WithTLS(app.SimpleTLSConfig(cert, pool)),
    app.WithAddress(net.JoinHostPort(nodeDNS, sqlPort)),
    app.WithCluster(statefulZeroDNS),
    app.WithLogFunc(func(l client.LogLevel, format string, a ...interface{}) {
        if l < 2 {
            return
        }
        log.Printf(fmt.Sprintf("%s: %s\n", l.String(), format), a...)
    }),
)

And each container creates its own certifcate, in the entrypint.

openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 \
    -nodes -keyout cluster.key -out cluster.crt -subj "/CN=$NODE_DNS" \
    -addext "subjectAltName=DNS:$NODE_DNS"

Will keep playing with it tomorrow. Its sleepy time now.

You should generate the certificate only once, then copy it on all nodes before starting them the first time If you want a separate certificate for each node, then a more complex app.WithTLS configuration would be needed (the app.SimpleTLSConfig helper is meant to support only a single certificate for all nodes in the cluster).

It works when they all have the same certificate. They all start up OK and find each other with host names.

There are still some issues though. When they start up, they still have this moment of warnings where they all report that they don't have a leader, but eventually it works.

But it becomes really problematic when using health checks. Because in that case app-1 does not start until app-0 is healthy, but app-0 does not become healthy because it wants to connect to app-1 as its leader.

The below output is after restarting the container with depends_on and healthcheck, to simulate a statefulset. app-0 keeps logging the below. Mostly no known leader and occasionally no such host:

app-0   | 2022/05/02 17:24:18 pod dns: app-0.app-headless.default.svc.cluster.local, cluster []
app-0   | 2022/05/02 17:24:18 WARN: attempt 0: server app-0.app-headless.default.svc.cluster.local:9000: no known leader
app-0   | 2022/05/02 17:24:18 WARN: attempt 0: server app-2.app-headless.default.svc.cluster.local:9000: dial: dial tcp: lookup app-2.app-headless.default.svc.cluster.local: no such host

After a while, it's marked as unhealthy by the orchestrator and killed, halting the entire application rollout.

Now, this may be better in Kubernetes, because I think it will also shut down the stateful set in reverse order so that they hand off their leadership, but I am not sure on this. And it's probably not very solid to rely on this. If it's even the case, I had to test it.

It works when not using func (a *App) Ready(ctx context.Context) error to wait before responding to health checks with an ok status, but its not really nice imo. Perhaps it can be used in a different way.

Then you get logs like the following, but eventually it works.

Logs

``` app-2 | 2022/05/02 17:43:56 pod dns: app-2.app-headless.default.svc.cluster.local, cluster [app-0.app-headless.default.svc.cluster.local:9000] app-0 | 2022/05/02 17:43:56 WARN: attempt 3: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-0 | 2022/05/02 17:43:56 WARN: attempt 3: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:56 WARN: attempt 0: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:56 WARN: attempt 0: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:56 WARN: attempt 0: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:56 WARN: attempt 0: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 pod dns: app-1.app-headless.default.svc.cluster.local, cluster [app-0.app-headless.default.svc.cluster.local:9000] app-2 | 2022/05/02 17:43:57 WARN: attempt 1: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:57 WARN: attempt 1: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 0: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 0: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:57 WARN: attempt 1: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:57 WARN: attempt 1: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 0: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 0: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 0: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 0: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 1: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 1: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 1: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 1: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 1: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:57 WARN: attempt 1: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:57 WARN: attempt 2: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:57 WARN: attempt 2: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:57 WARN: attempt 2: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:57 WARN: attempt 2: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-0 | 2022/05/02 17:43:57 WARN: attempt 4: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-0 | 2022/05/02 17:43:57 WARN: attempt 4: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 2: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 2: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 2: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 2: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 2: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 2: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:58 WARN: attempt 3: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:58 WARN: attempt 3: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:58 WARN: attempt 3: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:58 WARN: attempt 3: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-0 | 2022/05/02 17:43:58 WARN: attempt 5: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-0 | 2022/05/02 17:43:58 WARN: attempt 5: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 3: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 3: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 3: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:58 WARN: attempt 3: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:59 WARN: attempt 3: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:43:59 WARN: attempt 3: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:59 WARN: attempt 4: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:59 WARN: attempt 4: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:59 WARN: attempt 4: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:43:59 WARN: attempt 4: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-0 | 2022/05/02 17:43:59 WARN: attempt 6: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-0 | 2022/05/02 17:43:59 WARN: attempt 6: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:44:00 WARN: attempt 4: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:44:00 WARN: attempt 4: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:44:00 WARN: attempt 4: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:44:00 WARN: attempt 4: server app-2.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:44:00 WARN: attempt 4: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-1 | 2022/05/02 17:44:00 WARN: attempt 4: server app-1.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:44:00 WARN: attempt 5: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:44:00 WARN: attempt 5: server app-0.app-headless.default.svc.cluster.local:9000: no known leader app-2 | 2022/05/02 17:44:00 starting server app-0 | 2022/05/02 17:44:00 starting server app-1 | 2022/05/02 17:44:01 starting server ```

I'm not entirely sure about what your setup and code looks like (e.g. what are the healthy checks etc). Would it be possible to see the code that you are using? I'd recommend to start with the same code as the dqlite-demo.go example, adding the TLS configuration to it.

Using func (a *App) Ready(ctx context.Context) error should have no adverse effect, on the contrary it's recommended.

I started from that code. The code has the same issue, I would assume, because the net listener only starts to listen on the http port after Ready unblocks. But Ready won't unblock until a connection to the leader has been made, which may be a container that is started after the current one in the order of startup of the statefulset.

My healthcheck here is not doing much apart responding on a http ping endpoint. The idea was to use it to know if the server has actually started / app.Ready has unblocked.

Also keep in mind, I am taking about the scenario where all the container are stopped and then started again.

Here is the code I am currently using: https://gist.github.com/bluebrown/f5abd384da488a8f356042662a8b929d

The compose file to mimic a statefulsets behaviour looks like this, in a nutshell. I have removed some fields from the services for brevity. But the important parts, healthcheck and depends on, are there.

services:
  app-0:
    image: testapp
    healthcheck:
      test: httpcheck http://localhost:8080/ping

  app-1:
    image: testapp
    healthcheck:
      test: httpcheck http://localhost:8080/ping
    depends_on:
      app-0: { condition: "service_healthy" }

  app-2:
    image: testapp
    healthcheck:
      test: httpcheck http://localhost:8080/ping
    depends_on:
      app-0: { condition: "service_healthy" }
      app-1: { condition: "service_healthy" }

volumes:
  app-0:
  app-1:
  app-2:

I have created a repo here. You should be able to spin it the code up with make https://github.com/bluebrown/dqlite-experiment

Ok, thank you very much! That should make debugging easier. I didn't yet try it out, but I'm making some notes below in case you wan to experiment further. Perhaps @MathieuBordere could step in too and try to reproduce the problem using your repo?

The openssl option "subjectAltName=DNS:$(SERVICE_NAME)" is slightly different from the one indicated in the app.SimpleTLSConfig() docstring, which reads "subjectAltName=DNS:$DNS,IP:$IP", so basically the IP parameter is not there. If you get nodes communicating properly, I guess IP doesn't matter and perhaps we should remove it from the docstring too. Otherwise if you are nodes never communicate (there's never a leader), then it's something you might want to try.
There are 2 separate network endpoints (ports) on each node: one endpoint for inter-node dqlite communication (which is entirely managed by app.App) and one endpoint for the application API (which is entirely managed by the application, in your case the runGracefully()/newRouter() functions). The dqlite endpoint is opened by the app.New() function (which calls net.Listen() with the provided TLS certificate and hostname/port). That happens before the call to app.Ready(). In order to elect a leader, the call to app.New() is enough, there's no need for the application endpoint to be started. And the fact that the application endpoint should be started after app.Ready() is on purpose, because when app.Ready() returns the leader should have been elected, at which point the distributed db should be fully functional and you application can open its HTTP endpoint and start serving requests.
The heathcheck in the docker configuration above hits the application HTTP endpoint, not the inter-node dqlite one. I'm not familiar with the semantics of this configuration file (e.g. the depends_on option), but if this configuration somehow prevents some nodes from starting until the healthcheck of previous nodes is successful (or something like that), it will be a problem. The idea is that all nodes should be started in parallel with no particular ordering required. As long as a quorum of them has started, then app.Ready() should return without error and the application endpoint will be started.

As a side note regarding 3., I'll also add that even for the very first run you should be able to start all nodes in parallel (if app-1 or app-2 happen to start after app-0, then they should just wait for app-0). In that case the requirement is that app-0 actually eventually starts, because it will initially be a single-node cluster and app-1 and app-2 should join this initial cluster as soon they notice that app-0 is up. After this initial run, if you perform a full cluster restart, restarting all nodes in parallel app-0 is no longer required to successfully start, since app-1 and app-2 have successfully joined the cluster in the initial run, they can now form a proper quorum once they both start.

Hope that helps. Please let me know if always starting nodes in parallel regardless of the healthcheck does the trick.

To sum up a bit, the orchestrator (k8s or docker) should:

Always assign the same hostname to the same node
Always try to (re)start a node when it's down, with no particular ordering or dependency with respect to other nodes
If it's the very first run, then app-0 MUST eventually show up in order for the application to be functional
If it's not the first run, if 2 nodes out of 3 show up, then your application will be functional

Hi, thanks for the feedback. I add some though to the mentioned points.

1. Always assign the same hostname to the same node This can be guranteed by using a statefulsets Stable Network ID.

2. Always try to (re)start a node when it's down, with no particular ordering or dependency with respect to other nodes You could use a parallel podManagementPolicy.

3. If it's the very first run, then app-0 MUST eventually show up in order for the application to be functional This will eventually happen, given the assumption no faulty code is deployed.

4. If it's not the first run, if 2 nodes out of 3 show up, then your application will be functional This will eventually happen, thanks to the parallel management policy.

In that sense, I can remove the depends on clause from my compose simulation. So that they all start at the same time just like a parallel statefulset. That way, I can also block with app.Ready before responding to health checks.

So it should be all sorted out this way. Thanks again for your help :)

I have still 2 open question, apart from the solved main problem:

I think it would be good if there was a way to drop the certificate, though. It is just a performance hit, but doesn't serve a real purpose in this scenario. I will try to read the source code and understand where this limitation comes from. I already looked a litte bit around but couldnt spot it. It looks to be like its using a net.Listener regardless of tls.
The other thing is, what would happen if the DNS changed one day. I mean you want to persist your data, perhaps one day you need to deploy it somewhere else, even outside Kubernetes. How can we ensure we can always access the data or get the app back in a running state in case something with the addresses or DNS happens?

PS. Regarding the certificate. I removed the IP because the IP is not static. So I can't really provide it. It works without, it seems.

In the provided OpenSSL command is also using a single IP, but you will usually have more than 1 node, so even then the IP doesn't seem to make a lot of sense, since the cert is shared. It was the reason why I tried to give each app its own certificate at some point.

PS. Regarding the certificate. I removed the IP because the IP is not static. So I can't really provide it. It works without, it seems.

In the provided OpenSSL command is also using a single IP, but you will usually have more than 1 node, so even then the IP doesn't seem to make a lot of sense, since the cert is shared. It was the reason why I tried to give each app its own certificate at some point.

Good to know. I'll probably try to get rid of it as well and possibly change the docs. Thanks.

Hi, thanks for the feedback. I add some though to the mentioned points.

1. Always assign the same hostname to the same node This can be guranteed by using a statefulsets Stable Network ID.

2. Always try to (re)start a node when it's down, with no particular ordering or dependency with respect to other nodes You could use a parallel podManagementPolicy.

3. If it's the very first run, then app-0 MUST eventually show up in order for the application to be functional This will eventually happen, given the assumption no faulty code is deployed.

4. If it's not the first run, if 2 nodes out of 3 show up, then your application will be functional This will eventually happen, thanks to the parallel management policy.

In that sense, I can remove the depends on clause from my compose simulation. So that they all start at the same time just like a parallel statefulset. That way, I can also block with app.Ready before responding to health checks.

So it should be all sorted out this way. Thanks again for your help :)

I have still 2 open question, apart from the solved main problem:
* I think it would be good if there was a way to drop the certificate, though. It is just a performance hit, but doesn't serve a real purpose in this scenario. I will try to read the source code and understand where this limitation comes from. I already looked a litte bit around but couldnt spot it. It looks to be like its using a net.Listener regardless of tls.

First, I believe the performance hit is very likely negligible, so even if you go to straight TCP you're app probably won't run any faster. I don't have hard data, but network and disk latency should largely dominate any overhad due to TLS.

Having said that, the problem is that net.Listen() is actually not used if you don't set TLS. Rather, a straight bind() call is used in the C code. However, while net.Listen() also works with hostnames, bind() does not.

The reason net.Listen is used only when TLS is involved is that I didn't have time to introduce TLS support in the C code, so that was a cheap solution. The side effect is that hostnames are then supported in that case. The ideal solution would probably be to add TLS support at the dqlite C level, and support for hostnames too. But that requires some work.

* The other thing is, what would happen if the DNS changed one day. I mean you want to persist your data, perhaps one day you need to deploy it somewhere else, even outside Kubernetes. How can we ensure we can always access the data or get the app back in a running state in case something with the addresses or DNS happens?

In that case all you need is a copy of the data directory from any node, and then call node.ReconfigureMembership(). Or use the reconfigure command of the dqlite shell program, that internally just calls node.ReconfigureMembership for you.

OK makes, sense. Thank you. I am not good with c, otherwise I could try to contribute here.

The Reconfigure option is handy to have. That's good to know.

I will create a working Kubernetes example, considering the discussed points, and report back. Maybe someone can benefit from it in the future.

Thank you for taking your time, to explain all these things.

Having a working Kubernetes example it's definitely a valuable contribution, let us know, thanks!

I started working on a kubernetes setup but for some reason 1 pod out of 3 always fails. It works locally without any issues. The project is here https://github.com/bluebrown/dqlite-kubernetes-demo

The failing pod shows different types of warnings:

WARN: assign voter role to ourselves: a configuration change is already in progress (5)
WARN: assign voter role to ourselves: no connection to remote server available (1)

The app-0 pod, which is the entrypoint for the cluster also shows some warnings. But in this case app-2 was eventually able to connect while app-1 was the one failing because it was after 5 minutes still not ready.

022/07/30 06:56:45 starting server
2022/07/30 06:57:15 WARN: change dqlite-app-2.dqlite-app-headless.sandbox.svc.cluster.local:9000 from spare to voter: a configuration change is already in progress (5)
2022/07/30 06:57:15 WARN: adjust roles: could not assign role voter to any node

Hi, I have refactored the project but I am still not able to run it in kubernetes. It is currently in this branch https://github.com/bluebrown/dqlite-kubernetes-demo/tree/refactor.

Below are the logs of the 3 pods. Any idea what is wrong?

$ k logs dqlite-app-0
I0904 20:50:42.017238       1 main.go:128] "starting server" httpPort="8080" sqlPort="9000" level="INFO"
I0904 20:51:12.066036       1 kube-dqlite.go:72] "change dqlite-app-2.dqlite-app-headless.sandbox.svc.cluster.local:9000 from spare to voter: a configuration change is already in progress (5)" level="WARN"
I0904 20:51:12.066069       1 kube-dqlite.go:72] "adjust roles: could not assign role voter to any node" level="WARN"
I0904 20:51:42.164817       1 kube-dqlite.go:72] "change dqlite-app-2.dqlite-app-headless.sandbox.svc.cluster.local:9000 from spare to voter: a configuration change is already in progress (5)" level="WARN"
I0904 20:51:42.164845       1 kube-dqlite.go:72] "adjust roles: could not assign role voter to any node" level="WARN"
I0904 20:52:12.268568       1 kube-dqlite.go:72] "change dqlite-app-2.dqlite-app-headless.sandbox.svc.cluster.local:9000 from spare to voter: a configuration change is already in progress (5)" level="WARN"
I0904 20:52:12.268597       1 kube-dqlite.go:72] "adjust roles: could not assign role voter to any node" level="WARN"
I0904 20:52:42.387826       1 kube-dqlite.go:72] "change dqlite-app-2.dqlite-app-headless.sandbox.svc.cluster.local:9000 from spare to voter: a configuration change is already in progress (5)" level="WARN"
I0904 20:52:42.387857       1 kube-dqlite.go:72] "adjust roles: could not assign role voter to any node" level="WARN"
I0904 20:53:12.488257       1 kube-dqlite.go:72] "change dqlite-app-2.dqlite-app-headless.sandbox.svc.cluster.local:9000 from spare to voter: a configuration change is already in progress (5)" level="WARN"
I0904 20:53:12.488285       1 kube-dqlite.go:72] "adjust roles: could not assign role voter to any node" level="WARN"
I0904 20:53:42.597293       1 kube-dqlite.go:72] "change dqlite-app-2.dqlite-app-headless.sandbox.svc.cluster.local:9000 from spare to voter: a configuration change is already in progress (5)" level="WARN"
I0904 20:53:42.597318       1 kube-dqlite.go:72] "adjust roles: could not assign role voter to any node" level="WARN"

$ k logs dqlite-app-1
I0904 20:50:37.877369       1 kube-dqlite.go:72] "attempt 1: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:38.094022       1 kube-dqlite.go:72] "attempt 2: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:38.512602       1 kube-dqlite.go:72] "attempt 3: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:39.322182       1 kube-dqlite.go:72] "attempt 4: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:40.332311       1 kube-dqlite.go:72] "attempt 5: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:41.343892       1 kube-dqlite.go:72] "attempt 6: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:42.358067       1 kube-dqlite.go:72] "attempt 7: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:43.372936       1 kube-dqlite.go:72] "attempt 8: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:44.388223       1 kube-dqlite.go:72] "attempt 9: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:45.398918       1 kube-dqlite.go:72] "attempt 10: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:46.409388       1 kube-dqlite.go:72] "attempt 11: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:47.421141       1 kube-dqlite.go:72] "attempt 12: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:48.435436       1 kube-dqlite.go:72] "attempt 13: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:49.444319       1 kube-dqlite.go:72] "attempt 14: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:50.457052       1 kube-dqlite.go:72] "attempt 15: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:51.475682       1 kube-dqlite.go:72] "attempt 16: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:52.486987       1 kube-dqlite.go:72] "attempt 17: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:51:43.911126       1 kube-dqlite.go:72] "assign voter role to ourselves: no connection to remote server available (1)" level="WARN"
I0904 20:52:35.453527       1 kube-dqlite.go:72] "assign voter role to ourselves: no connection to remote server available (1)" level="WARN"
I0904 20:53:26.991293       1 kube-dqlite.go:72] "assign voter role to ourselves: no connection to remote server available (1)" level="WARN"
I0904 20:54:18.527386       1 kube-dqlite.go:72] "assign voter role to ourselves: no connection to remote server available (1)" level="WARN"

k logs dqlite-app-2
I0904 20:50:37.731536       1 kube-dqlite.go:72] "attempt 1: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:37.940357       1 kube-dqlite.go:72] "attempt 2: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:38.349788       1 kube-dqlite.go:72] "attempt 3: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:39.159941       1 kube-dqlite.go:72] "attempt 4: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:40.168905       1 kube-dqlite.go:72] "attempt 5: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:41.178705       1 kube-dqlite.go:72] "attempt 6: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:42.189082       1 kube-dqlite.go:72] "attempt 7: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:43.198867       1 kube-dqlite.go:72] "attempt 8: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:44.207721       1 kube-dqlite.go:72] "attempt 9: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:45.216602       1 kube-dqlite.go:72] "attempt 10: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:46.227215       1 kube-dqlite.go:72] "attempt 11: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:47.238356       1 kube-dqlite.go:72] "attempt 12: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:48.248034       1 kube-dqlite.go:72] "attempt 13: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:49.275251       1 kube-dqlite.go:72] "attempt 14: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:50.297809       1 kube-dqlite.go:72] "attempt 15: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:51.306230       1 kube-dqlite.go:72] "attempt 16: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:52.317964       1 kube-dqlite.go:72] "attempt 17: server dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local:9000: dial: dial tcp: lookup dqlite-app-0.dqlite-app-headless.sandbox.svc.cluster.local: no such host" level="WARN"
I0904 20:50:53.488098       1 main.go:128] "starting server" httpPort="8080" sqlPort="9000" level="INFO"

I feel like app-2 is giving up before app-0, is ready. It's strange because app-1 succeeds.

After 5 minutes, app-2 was restarted. Now I have the below logs for app-2. It's also strange that in app-2 which is supposed to be the cluster node to connect to, does not show any logs that app-2 is communicating with it.

k logs dqlite-app-2
I0904 21:28:46.970090       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:28:48.556606       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:28:50.205212       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:28:51.611694       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:28:53.080958       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:28:54.349198       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:28:55.706263       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:28:57.185216       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:28:58.390537       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:28:59.517224       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:29:01.178567       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:29:02.516589       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:29:03.628866       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:29:04.725551       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:29:05.930173       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"
I0904 21:29:07.079364       1 kube-dqlite.go:72] "assign voter role to ourselves: a configuration change is already in progress (5)" level="WARN"

Ok, I found the issue.

The reason its failing is that the nodes need to communicate to each other. But the readiness probe paired with the cluster dns via headless service is preventing that. This is because a service will not route traffic to a pod that is not ready. In this case the pod wont become ready unless it can have that traffic.

A quick and dirty solution is to disable the health checks but I think its also possible to connect the pods directly without going over the service. That is using <pod>-<n>.<namespace>.svc.cluster.local instead of <pod>-<n>.<headless-service>.<namespace>.svc.cluster.local as hostname for the cluster members.

It would be even better if dqlite would resolve hosts based on the search option in the /etc/resolv.conf, but AFAIK it's not doing that. This would be ideal because it would allow promoting the application, across namespaces, without manual reconfiguration. Because currently if the namespace changes, the dns changes and dqlite is in a somewhat broken state.

go-dqlite just depends on the DNS resolving capabilities of net.Dial. I haven't yet investigated this, but are you sure your container has the correct permissions to access /etc/resolv.conf?

OK, using the search option in /etc/resolv.conf does actually work. The issue was that a statefulset requires using the governing headless service for DNS resolution. It's not possible to resolve the pods by name.

I found a well hidden option which I could use on the headless service. Setting .spec.publishNotReadyAddresses: true, allows the pods to communicate to each other before they are ready, so that the cluster can form. This way, the health checks can be enabled and used by the regular service through which client usually connect. The headless service is only for dqlite-internal communication.

I think with that, all the problems are solved:

the cluster can form successfully using DNS (dynamic IPs)
no external traffic can reach the cluster nodes before they are healthy
the nodes can find each other without fqdn, only <pod>.<service>, meaning they can be promoted to different namespaces.

I am planning to explain the setup in detail in the readme of the project. I have already merged the branch. https://github.com/bluebrown/dqlite-kubernetes-demo. If you are interested, you can have a look and perhaps provide feedback. If its all good, maybe we can link it in your documentation so that someone else does not have to go through the same hustle.

canonical / go-dqlite

Dynamic addresses #189