cloudfoundry-incubator / quarks-operator

BOSH releases deployed on Kubernetes
https://www.cloudfoundry.org/project-quarks/
Apache License 2.0
49 stars 35 forks source link

Randomly missing one or several connection to other routes in NATS cluster, due to incorrect nats entry in dns #1221

Closed chenxpcn closed 4 years ago

chenxpcn commented 4 years ago

Describe the bug Deployed 2 NATS pod in multi-zone environment, nats-z0-0 and nats-z1-0, enable tls on both of them, specified each nats server node to connect to other 3 nats nodes, so there should be 6 connections in total, but it always missing one or several route connections.

To Reproduce 2 nats pods

nats-z0-0                                7/7     Running     0          133m    198.18.21.32    10.240.1.15     <none>           <none>
nats-z1-0                                7/7     Running     0          133m    198.18.58.136   10.240.64.62    <none>           <none>

route table in nats.conf

routes = [

    nats-route://nats:some-passwords@nats-z0-0.nats.service.cf.internal:4223

    nats-route://nats:some-passwords@nats-z1-0.nats.service.cf.internal:4223

    nats-route://nats:some-passwords@nats-z0-0.nats.service.cf.internal:4225

    nats-route://nats:some-passwords@nats-z1-0.nats.service.cf.internal:4225

  ]

route connections missed nats-z1-0:nats <--> nats-z1-0:nats-tls

nats-z0-0: nats
tcp    ESTAB      0      0      198.18.21.32:4223               198.18.58.136:50170               users:(("gnatsd",pid=14,fd=23))
tcp    ESTAB      0      0      198.18.21.32:4223               198.18.21.32:47778               users:(("gnatsd",pid=14,fd=10))
tcp    ESTAB      0      0      198.18.21.32:4223               198.18.58.136:50160               users:(("gnatsd",pid=14,fd=19))
nats-z0-0: nats-tls
tcp    ESTAB      0      0      198.18.21.32:47778              198.18.21.32:4223                users:(("gnatsd",pid=18,fd=15))
tcp    ESTAB      0      0      198.18.21.32:4225               198.18.58.136:49396               users:(("gnatsd",pid=18,fd=12))
tcp    ESTAB      0      0      198.18.21.32:4225               198.18.58.136:49406               users:(("gnatsd",pid=18,fd=16))
nats-z1-0: nats
tcp    ESTAB      0      0      198.18.58.136:50160              198.18.21.32:4223                users:(("gnatsd",pid=16,fd=10))
tcp    ESTAB      0      0      198.18.58.136:49396              198.18.21.32:4225                users:(("gnatsd",pid=16,fd=7))
nats-z1-0: nats-tls
tcp    ESTAB      0      0      198.18.58.136:50170              198.18.21.32:4223                users:(("gnatsd",pid=15,fd=11))
tcp    ESTAB      0      0      198.18.58.136:49406              198.18.21.32:4225                users:(("gnatsd",pid=15,fd=8))

Restart nats pod, nats-z1-0:nats <--> nats-z1-0:nats-tls might be connected but will missing other connections.

It looks like the issue is due to incorrect dns entries.

When try to lookup the nats-z1-0 on nats-z1-0:nats

/:/var/vcap/jobs/nats# dig nats-z1-0.nats.service.cf.internal

; <<>> DiG 9.11.2 <<>> nats-z1-0.nats.service.cf.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41767
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 43ac7631b610e9e5 (echoed)
;; QUESTION SECTION:
;nats-z1-0.nats.service.cf.internal. IN A

;; ANSWER SECTION:
nats-z1-0.nats.service.cf.internal. 30 IN CNAME nats-z1-0.kubecf.svc.cluster.local.
nats-z1-0.kubecf.svc.cluster.local. 30 IN A 198.18.58.136
nats-z1-0.kubecf.svc.cluster.local. 30 IN A 198.18.21.32

;; Query time: 6 msec
;; SERVER: 198.19.171.76#53(198.19.171.76)
;; WHEN: Fri Oct 30 09:46:48 UTC 2020
;; MSG SIZE  rcvd: 207

2 IPs are returned.

Seems when nats-z1-0:nats tried to connect to nats-z1-0:nats-tls, if correct ip is returned, the connection will be established normally, but if got incorrect ip, nats-z1-0:nats will find the connection (nats-z1-0:nats<-->nats-z0-0:nats-tls) had been established already, so ignore or drop the new connection.

It seems after manually switch nats general entry and nats instance group entries in bosh-dns, lookup nats service ip is correct. The current dns record

template
    IN A nats.service.cf.internal {
    match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
    upstream
}
template
    IN AAAA nats.service.cf.internal {
    match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
    upstream
}
template
    IN CNAME nats.service.cf.internal {
    match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
    upstream
}

template
    IN A nats-z0-0.nats.service.cf.internal {
    match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
    upstream
}
template
    IN AAAA nats-z0-0.nats.service.cf.internal {
    match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
    upstream
}
template
    IN CNAME nats-z0-0.nats.service.cf.internal {
    match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
    upstream
}

template
    IN A nats-z1-0.nats.service.cf.internal {
    match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
    upstream
}
template
    IN AAAA nats-z1-0.nats.service.cf.internal {
    match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
    upstream
}
template
    IN CNAME nats-z1-0.nats.service.cf.internal {
    match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
    upstream
}
Modified dns record is:

template
    IN A nats-z0-0.nats.service.cf.internal {
    match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
    upstream
}
template
    IN AAAA nats-z0-0.nats.service.cf.internal {
    match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
    upstream
}
template
    IN CNAME nats-z0-0.nats.service.cf.internal {
    match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
    upstream
}

template
    IN A nats-z1-0.nats.service.cf.internal {
    match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
    upstream
}
template
    IN AAAA nats-z1-0.nats.service.cf.internal {
    match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
    upstream
}
template
    IN CNAME nats-z1-0.nats.service.cf.internal {
    match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
    upstream
}

template
    IN A nats.service.cf.internal {
    match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
    upstream
}
template
    IN AAAA nats.service.cf.internal {
    match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
    upstream
}
template
    IN CNAME nats.service.cf.internal {
    match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
    answer
    \"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
    upstream
}

So this should be operator issue or deploy configuration issue.

Now the lookup result is

/:/var/vcap/jobs/nats# dig nats-z1-0.nats.service.cf.internal

; <<>> DiG 9.11.2 <<>> nats-z1-0.nats.service.cf.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57411
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 26052335b64df3fc (echoed)
;; QUESTION SECTION:
;nats-z1-0.nats.service.cf.internal. IN A

;; ANSWER SECTION:
nats-z1-0.nats.service.cf.internal. 30 IN CNAME nats-z1-0.kubecf.svc.cluster.local.
nats-z1-0.kubecf.svc.cluster.local. 30 IN A 198.19.46.148

;; Query time: 6 msec
;; SERVER: 198.19.171.76#53(198.19.171.76)
;; WHEN: Mon Nov 02 08:33:41 UTC 2020
;; MSG SIZE  rcvd: 207

Expected behavior Every nats server node should have 3 connections to other nats server node

cf-gitbot commented 4 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/175551311

The labels on this github issue will be updated when the story is started.

chenxpcn commented 4 years ago

Moved from https://github.com/cloudfoundry-incubator/kubecf/issues/1535

chenxpcn commented 4 years ago

Dup to https://github.com/cloudfoundry-incubator/quarks-operator/issues/1118