Describe the bug
Deployed 2 NATS pod in multi-zone environment, nats-z0-0 and nats-z1-0, enable tls on both of them, specified each nats server node to connect to other 3 nats nodes, so there should be 6 connections in total, but it always missing one or several route connections.
Restart nats pod, nats-z1-0:nats <--> nats-z1-0:nats-tls might be connected but will missing other connections.
It looks like the issue is due to incorrect dns entries.
When try to lookup the nats-z1-0 on nats-z1-0:nats
/:/var/vcap/jobs/nats# dig nats-z1-0.nats.service.cf.internal
; <<>> DiG 9.11.2 <<>> nats-z1-0.nats.service.cf.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41767
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 43ac7631b610e9e5 (echoed)
;; QUESTION SECTION:
;nats-z1-0.nats.service.cf.internal. IN A
;; ANSWER SECTION:
nats-z1-0.nats.service.cf.internal. 30 IN CNAME nats-z1-0.kubecf.svc.cluster.local.
nats-z1-0.kubecf.svc.cluster.local. 30 IN A 198.18.58.136
nats-z1-0.kubecf.svc.cluster.local. 30 IN A 198.18.21.32
;; Query time: 6 msec
;; SERVER: 198.19.171.76#53(198.19.171.76)
;; WHEN: Fri Oct 30 09:46:48 UTC 2020
;; MSG SIZE rcvd: 207
2 IPs are returned.
Seems when nats-z1-0:nats tried to connect to nats-z1-0:nats-tls, if correct ip is returned, the connection will be established normally, but if got incorrect ip, nats-z1-0:nats will find the connection (nats-z1-0:nats<-->nats-z0-0:nats-tls) had been established already, so ignore or drop the new connection.
It seems after manually switch nats general entry and nats instance group entries in bosh-dns, lookup nats service ip is correct.
The current dns record
template
IN A nats.service.cf.internal {
match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
upstream
}
template
IN AAAA nats.service.cf.internal {
match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
upstream
}
template
IN CNAME nats.service.cf.internal {
match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
upstream
}
template
IN A nats-z0-0.nats.service.cf.internal {
match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
upstream
}
template
IN AAAA nats-z0-0.nats.service.cf.internal {
match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
upstream
}
template
IN CNAME nats-z0-0.nats.service.cf.internal {
match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
upstream
}
template
IN A nats-z1-0.nats.service.cf.internal {
match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
upstream
}
template
IN AAAA nats-z1-0.nats.service.cf.internal {
match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
upstream
}
template
IN CNAME nats-z1-0.nats.service.cf.internal {
match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
upstream
}
Modified dns record is:
template
IN A nats-z0-0.nats.service.cf.internal {
match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
upstream
}
template
IN AAAA nats-z0-0.nats.service.cf.internal {
match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
upstream
}
template
IN CNAME nats-z0-0.nats.service.cf.internal {
match ^nats-z0-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z0-0.kubecf.svc.cluster.local\"
upstream
}
template
IN A nats-z1-0.nats.service.cf.internal {
match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
upstream
}
template
IN AAAA nats-z1-0.nats.service.cf.internal {
match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
upstream
}
template
IN CNAME nats-z1-0.nats.service.cf.internal {
match ^nats-z1-0\\.nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats-z1-0.kubecf.svc.cluster.local\"
upstream
}
template
IN A nats.service.cf.internal {
match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
upstream
}
template
IN AAAA nats.service.cf.internal {
match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
upstream
}
template
IN CNAME nats.service.cf.internal {
match ^(([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])\\.)*nats\\.service\\.cf\\.internal\\.$
answer
\"{{ .Name }} 60 IN CNAME nats.kubecf.svc.cluster.local\"
upstream
}
So this should be operator issue or deploy configuration issue.
Now the lookup result is
/:/var/vcap/jobs/nats# dig nats-z1-0.nats.service.cf.internal
; <<>> DiG 9.11.2 <<>> nats-z1-0.nats.service.cf.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57411
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 26052335b64df3fc (echoed)
;; QUESTION SECTION:
;nats-z1-0.nats.service.cf.internal. IN A
;; ANSWER SECTION:
nats-z1-0.nats.service.cf.internal. 30 IN CNAME nats-z1-0.kubecf.svc.cluster.local.
nats-z1-0.kubecf.svc.cluster.local. 30 IN A 198.19.46.148
;; Query time: 6 msec
;; SERVER: 198.19.171.76#53(198.19.171.76)
;; WHEN: Mon Nov 02 08:33:41 UTC 2020
;; MSG SIZE rcvd: 207
Expected behavior
Every nats server node should have 3 connections to other nats server node
Describe the bug Deployed 2 NATS pod in multi-zone environment, nats-z0-0 and nats-z1-0, enable tls on both of them, specified each nats server node to connect to other 3 nats nodes, so there should be 6 connections in total, but it always missing one or several route connections.
To Reproduce 2 nats pods
route table in nats.conf
route connections missed nats-z1-0:nats <--> nats-z1-0:nats-tls
Restart nats pod, nats-z1-0:nats <--> nats-z1-0:nats-tls might be connected but will missing other connections.
It looks like the issue is due to incorrect dns entries.
When try to lookup the nats-z1-0 on nats-z1-0:nats
2 IPs are returned.
Seems when nats-z1-0:nats tried to connect to nats-z1-0:nats-tls, if correct ip is returned, the connection will be established normally, but if got incorrect ip, nats-z1-0:nats will find the connection (nats-z1-0:nats<-->nats-z0-0:nats-tls) had been established already, so ignore or drop the new connection.
It seems after manually switch nats general entry and nats instance group entries in bosh-dns, lookup nats service ip is correct. The current dns record
So this should be operator issue or deploy configuration issue.
Now the lookup result is
Expected behavior Every nats server node should have 3 connections to other nats server node