Open Smithx10 opened 5 years ago
All Gossip for these nodes are running over 8301
ipfstat -t -P udp -G 9b11635d-453e-6ad4-f525-cf504ae5a541
Src: 0.0.0.0, Dest: 0.0.0.0, Proto: udp, Sorted by: # bytes
Source IP Destination IP ST PR #pkts #bytes ttl
10.1.1.107,8301 10.1.1.99,8301 0/0 udp 74 11032 1:59
10.1.1.107,8301 10.1.1.108,8301 0/0 udp 75 10888 1:57
10.1.1.107,8301 10.1.1.100,8301 0/0 udp 73 10487 0:10
10.1.1.107,8301 10.1.1.101,8301 0/0 udp 72 10332 0:03
10.1.1.107,8301 10.1.1.106,8301 0/0 udp 71 9591 0:06
10.1.1.107,8301 10.1.1.104,8301 0/0 udp 58 7565 1:54
10.1.1.107,8301 10.1.1.103,8301 0/0 udp 30 3938 1:59
10.1.1.107,8301 10.1.1.105,8301 0/0 udp 14 1841 1:54
10.1.1.107,8301 10.1.1.102,8301 0/0 udp 1 83 0:12
At this point I had to redeploy the same configuration as before since I implemented the fix by disabling a rule.
Here is a updated list of addresses / instances:
ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/rethinkdb ❯❯❯ itt ls -l | grep 'rethinkdb\|consul' ✘ 130 master ✖ ✱ ◼
2c70e9f0-fcdb-6634-aa51-e8d8ce367a5d consul-554bbdbfb5-pfd5s img-consul-k8s@1547829591 lx sample-512M running F 10.45.137.29 2019-01-18T22:37:13.876Z
01cfcf81-b2ac-6251-9c48-c2821449234e consul-554bbdbfb5-876r2 img-consul-k8s@1547829591 lx sample-512M running F 10.45.137.11 2019-01-18T22:37:16.550Z
841fc84f-1a39-4148-b923-8138b290b84d consul-554bbdbfb5-ztcq9 img-consul-k8s@1547829591 lx sample-512M running F 10.45.137.27 2019-01-18T22:37:19.062Z
181ad4f2-4bef-6536-e821-9da8923f164b consul-554bbdbfb5-r8h2j img-consul-k8s@1547829591 lx sample-512M running F 10.45.137.25 2019-01-18T22:37:23.129Z
1171054c-f076-c128-9415-f1595822509f consul-554bbdbfb5-gd95b img-consul-k8s@1547829591 lx sample-512M running F 10.45.137.28 2019-01-18T22:37:24.393Z
e5833a95-4c29-cf4d-cfa4-887d2ddc6514 rethinkdb-66744f7d-f9cl7 img-rethinkdb-master@1547844250 lx sample-512M running F 10.45.137.22 2019-01-18T22:38:18.924Z
930436bf-5dac-648b-c406-9fa784bdec5f rethinkdb-66744f7d-zjgpg img-rethinkdb-master@1547844250 lx sample-512M running F 10.45.137.21 2019-01-18T22:38:22.228Z
d5b48f22-db56-4e22-a744-c6df3255d3ab rethinkdb-66744f7d-mjnjg img-rethinkdb-master@1547844250 lx sample-512M running F 10.45.137.20 2019-01-18T22:38:25.932Z
63c85685-9eb0-eb14-a707-84febcda7769 rethinkdb-66744f7d-sm6hl img-rethinkdb-master@1547844250 lx sample-512M running F 10.45.137.24 2019-01-18T22:38:30.034Z
419bc275-4306-42ca-fc40-c06a21ee86be rethinkdb-66744f7d-z5qvr img-rethinkdb-master@1547844250 lx sample-512M running F 10.45.137.26 2019-01-18T22:38:31.871Z
ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ consul members master ✖ ✱ ◼
Node Address Status Type Build Protocol DC Segment
01cfcf81-b2ac-6251-9c48-c2821449234e 10.1.1.109:8301 alive server 1.3.0 2 dc1 <all>
1171054c-f076-c128-9415-f1595822509f 10.1.1.113:8301 alive server 1.3.0 2 dc1 <all>
181ad4f2-4bef-6536-e821-9da8923f164b 10.1.1.111:8301 alive server 1.3.0 2 dc1 <all>
2c70e9f0-fcdb-6634-aa51-e8d8ce367a5d 10.1.1.110:8301 alive server 1.3.0 2 dc1 <all>
841fc84f-1a39-4148-b923-8138b290b84d 10.1.1.112:8301 alive server 1.3.0 2 dc1 <all>
419bc275-4306-42ca-fc40-c06a21ee86be 10.1.1.117:8301 alive client 1.3.0 2 dc1 <default>
Working Client Rules:
[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# ipfstat -nio -G 419bc275-4306-42ca-fc40-c06a21ee86be
@1 pass out quick proto tcp from any to any flags S/SA keep state
@2 pass out proto tcp from any to any
@3 pass out proto udp from any to any keep state
@4 pass out quick proto icmp from any to any keep state
@5 pass out proto icmp from any to any
@1 pass in quick proto icmp from any to any keep frags
@2 pass in quick proto tcp from 10.1.1.109/32 to any keep frags
@3 pass in quick proto tcp from 10.45.137.11/32 to any keep frags
@4 pass in quick proto tcp from 10.1.1.110/32 to any keep frags
@5 pass in quick proto tcp from 10.45.137.29/32 to any keep frags
@6 pass in quick proto tcp from 10.1.1.111/32 to any keep frags
@7 pass in quick proto tcp from 10.45.137.25/32 to any keep frags
@8 pass in quick proto tcp from 10.1.1.112/32 to any keep frags
@9 pass in quick proto tcp from 10.45.137.27/32 to any keep frags
@10 pass in quick proto tcp from 10.1.1.113/32 to any keep frags
@11 pass in quick proto tcp from 10.45.137.28/32 to any keep frags
@12 pass in quick proto tcp from 10.1.1.114/32 to any keep frags
@13 pass in quick proto tcp from 10.45.137.21/32 to any keep frags
@14 pass in quick proto tcp from 10.1.1.115/32 to any keep frags
@15 pass in quick proto tcp from 10.45.137.20/32 to any keep frags
@16 pass in quick proto tcp from 10.1.1.116/32 to any keep frags
@17 pass in quick proto tcp from 10.45.137.24/32 to any keep frags
@18 pass in quick proto tcp from 10.1.1.117/32 to any keep frags
@19 pass in quick proto tcp from 10.45.137.26/32 to any keep frags
@20 pass in quick proto tcp from 10.1.1.118/32 to any keep frags
@21 pass in quick proto tcp from 10.45.137.22/32 to any keep frags
@22 pass in quick proto tcp from any to any port = ssh keep frags
@23 pass in quick proto tcp from any to any port = http-alt keep frags
@24 pass in quick proto tcp from any to any port = 28015 keep frags
@25 pass in quick proto udp from 10.1.1.109/32 to any keep frags
@26 pass in quick proto udp from 10.45.137.11/32 to any keep frags
@27 pass in quick proto udp from 10.1.1.110/32 to any keep frags
@28 pass in quick proto udp from 10.45.137.29/32 to any keep frags
@29 pass in quick proto udp from 10.1.1.111/32 to any keep frags
@30 pass in quick proto udp from 10.45.137.25/32 to any keep frags
@31 pass in quick proto udp from 10.1.1.112/32 to any keep frags
@32 pass in quick proto udp from 10.45.137.27/32 to any keep frags
@33 pass in quick proto udp from 10.1.1.113/32 to any keep frags
@34 pass in quick proto udp from 10.45.137.28/32 to any keep frags
@35 pass in quick proto udp from 10.1.1.114/32 to any keep frags
@36 pass in quick proto udp from 10.45.137.21/32 to any keep frags
@37 pass in quick proto udp from 10.1.1.115/32 to any keep frags
@38 pass in quick proto udp from 10.45.137.20/32 to any keep frags
@39 pass in quick proto udp from 10.1.1.116/32 to any keep frags
@40 pass in quick proto udp from 10.45.137.24/32 to any keep frags
@41 pass in quick proto udp from 10.1.1.117/32 to any keep frags
@42 pass in quick proto udp from 10.45.137.26/32 to any keep frags
@43 pass in quick proto udp from 10.1.1.118/32 to any keep frags
@44 pass in quick proto udp from 10.45.137.22/32 to any keep frags
@45 block in all
Non Working Client Rules:
[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# ipfstat -nio -G 63c85685-9eb0-eb14-a707-84febcda7769
@1 pass out quick proto tcp from any to any flags S/SA keep state
@2 pass out proto tcp from any to any
@3 pass out proto udp from any to any keep state
@4 pass out quick proto icmp from any to any keep state
@5 pass out proto icmp from any to any
@1 pass in quick proto icmp from any to any keep frags
@2 pass in quick proto tcp from 10.1.1.109/32 to any keep frags
@3 pass in quick proto tcp from 10.45.137.11/32 to any keep frags
@4 pass in quick proto tcp from 10.1.1.110/32 to any keep frags
@5 pass in quick proto tcp from 10.45.137.29/32 to any keep frags
@6 pass in quick proto tcp from 10.1.1.111/32 to any keep frags
@7 pass in quick proto tcp from 10.45.137.25/32 to any keep frags
@8 pass in quick proto tcp from 10.1.1.112/32 to any keep frags
@9 pass in quick proto tcp from 10.45.137.27/32 to any keep frags
@10 pass in quick proto tcp from 10.1.1.113/32 to any keep frags
@11 pass in quick proto tcp from 10.45.137.28/32 to any keep frags
@12 pass in quick proto tcp from 10.1.1.114/32 to any keep frags
@13 pass in quick proto tcp from 10.45.137.21/32 to any keep frags
@14 pass in quick proto tcp from 10.1.1.115/32 to any keep frags
@15 pass in quick proto tcp from 10.45.137.20/32 to any keep frags
@16 pass in quick proto tcp from 10.1.1.116/32 to any keep frags
@17 pass in quick proto tcp from 10.45.137.24/32 to any keep frags
@18 pass in quick proto tcp from 10.1.1.117/32 to any keep frags
@19 pass in quick proto tcp from 10.45.137.26/32 to any keep frags
@20 pass in quick proto tcp from 10.1.1.118/32 to any keep frags
@21 pass in quick proto tcp from 10.45.137.22/32 to any keep frags
@22 pass in quick proto tcp from any to any port = ssh keep frags
@23 pass in quick proto tcp from any to any port = http-alt keep frags
@24 pass in quick proto tcp from any to any port = 28015 keep frags
@25 pass in quick proto udp from 10.1.1.109/32 to any keep frags
@26 pass in quick proto udp from 10.45.137.11/32 to any keep frags
@27 pass in quick proto udp from 10.1.1.110/32 to any keep frags
@28 pass in quick proto udp from 10.45.137.29/32 to any keep frags
@29 pass in quick proto udp from 10.1.1.111/32 to any keep frags
@30 pass in quick proto udp from 10.45.137.25/32 to any keep frags
@31 pass in quick proto udp from 10.1.1.112/32 to any keep frags
@32 pass in quick proto udp from 10.45.137.27/32 to any keep frags
@33 pass in quick proto udp from 10.1.1.113/32 to any keep frags
@34 pass in quick proto udp from 10.45.137.28/32 to any keep frags
@35 pass in quick proto udp from 10.1.1.114/32 to any keep frags
@36 pass in quick proto udp from 10.45.137.21/32 to any keep frags
@37 pass in quick proto udp from 10.1.1.115/32 to any keep frags
@38 pass in quick proto udp from 10.45.137.20/32 to any keep frags
@39 pass in quick proto udp from 10.1.1.116/32 to any keep frags
@40 pass in quick proto udp from 10.45.137.24/32 to any keep frags
@41 pass in quick proto udp from 10.1.1.117/32 to any keep frags
@42 pass in quick proto udp from 10.45.137.26/32 to any keep frags
@43 pass in quick proto udp from 10.1.1.118/32 to any keep frags
@44 pass in quick proto udp from 10.45.137.22/32 to any keep frags
@45 block in all
Snoop from the working and non working zones:
[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# snoop -z 419bc275-4306-42ca-fc40-c06a21ee86be -d eth1
Using device eth1 (promiscuous mode)
10.1.1.117 -> 10.1.1.109 UDP D=8301 S=8301 LEN=63
10.1.1.109 -> 10.1.1.117 UDP D=8301 S=8301 LEN=160
10.1.1.117 -> * ARP C Who is 10.1.1.112, 10.1.1.112 ?
10.1.1.112 -> 10.1.1.117 ARP R 10.1.1.112, 10.1.1.112 is 90:b8:d0:39:6a:59
10.1.1.117 -> 10.1.1.111 UDP D=8301 S=8301 LEN=63
10.1.1.112 -> * ARP C Who is 10.1.1.117, 10.1.1.117 ?
10.1.1.117 -> 10.1.1.112 ARP R 10.1.1.117, 10.1.1.117 is 90:b8:d0:49:76:5
10.1.1.111 -> 10.1.1.117 UDP D=8301 S=8301 LEN=160
10.1.1.117 -> 10.1.1.110 TCP D=8300 S=56064 Push Ack=379882189 Seq=2261364876 Len=12 Win=33000 Options=<nop,nop,tstamp 445712767 445707768>
10.1.1.117 -> 10.1.1.110 TCP D=8300 S=56064 Push Ack=379882189 Seq=2261364888 Len=338 Win=33000 Options=<nop,nop,tstamp 445712767 445707768>
10.1.1.110 -> 10.1.1.117 TCP D=56064 S=8300 Ack=2261365226 Seq=379882189 Len=0 Win=33000 Options=<nop,nop,tstamp 445712767 445712767>
10.1.1.110 -> 10.1.1.117 TCP D=56064 S=8300 Push Ack=2261365226 Seq=379882189 Len=12 Win=33000 Options=<nop,nop,tstamp 445712767 445712767>
10.1.1.110 -> 10.1.1.117 TCP D=56064 S=8300 Push Ack=2261365226 Seq=379882201 Len=1491 Win=33000 Options=<nop,nop,tstamp 445712767 445712767>
10.1.1.117 -> 10.1.1.110 TCP D=8300 S=56064 Ack=379883692 Seq=2261365226 Len=0 Win=33000 Options=<nop,nop,tstamp 445712767 445712767>
10.1.1.109 -> 10.1.1.117 UDP D=8301 S=8301 LEN=63
10.1.1.117 -> 10.1.1.109 UDP D=8301 S=8301 LEN=160
^C[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# snoop -z 63c85685-9eb0-eb14-a707-84febcda7769 -d eth1
Using device eth1 (promiscuous mode)
10.1.1.113 -> 10.1.1.117 ARP R 10.1.1.113, 10.1.1.113 is 90:b8:d0:cd:5d:f6
10.1.1.116 -> 10.1.1.111 TCP D=8301 S=43198 Syn Seq=3786205279 Len=0 Win=32782 Options=<mss 8460,sackOK,tstamp 445737587 0,nop,wscale 5>
10.1.1.117 -> 10.1.1.109 ARP R 10.1.1.117, 10.1.1.117 is 90:b8:d0:49:76:5
10.1.1.109 -> 10.1.1.117 ARP R 10.1.1.109, 10.1.1.109 is 90:b8:d0:c2:f:48
10.1.1.113 -> 10.1.1.115 ARP R 10.1.1.113, 10.1.1.113 is 90:b8:d0:cd:5d:f6
10.1.1.116 -> 10.1.1.111 TCP D=8301 S=43198 Syn Seq=3786205279 Len=0 Win=32782 Options=<mss 8460,sackOK,tstamp 445738717 0,nop,wscale 5>
10.1.1.112 -> 10.1.1.114 ARP R 10.1.1.112, 10.1.1.112 is 90:b8:d0:39:6a:59
10.1.1.116 -> 10.1.1.111 TCP D=8301 S=43198 Syn Seq=3786205279 Len=0 Win=32782 Options=<mss 8460,sackOK,tstamp 445740977 0,nop,wscale 5>
10.1.1.116 -> * ARP C Who is 10.1.1.111, 10.1.1.111 ?
10.1.1.111 -> 10.1.1.116 ARP R 10.1.1.111, 10.1.1.111 is 90:b8:d0:bf:7f:eb
dig for the cname used to join:
[root@63c85685-9eb0-eb14-a707-84febcda7769 ~]# dig consul.consul.svc.cloudops-dev.us-east-1.cns.cloud.iqvia.net +short
10.1.1.110
10.1.1.112
10.1.1.113
10.1.1.111
10.1.1.109
restart of a non joined node:
[root@63c85685-9eb0-eb14-a707-84febcda7769 ~]# systemctl restart containerpilot && journalctl -u containerpilot -fl
-- Logs begin at Fri 2019-01-18 22:38:42 UTC. --
Jan 18 23:41:46 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:46.514035849Z consul-agent 854829 2019/01/18 23:41:46 [INFO] agent: Waiting for endpoints to shut down
Jan 18 23:41:46 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:46.514063947Z consul-agent 854829 2019/01/18 23:41:46 [INFO] agent: Endpoints down
Jan 18 23:41:46 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:46.514085208Z consul-agent 854829 2019/01/18 23:41:46 [INFO] agent: Exit code: 0
Jan 18 23:41:46 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:46.51426164Z preStop 224222 Graceful leave complete
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.540190595Z killing processes for job "preStart"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.54025145Z killing processes for job "rethinkdb"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.540271279Z killing processes for job "rethinkdb-ui"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.54028618Z killing processes for job "consul-agent"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.540302616Z killing processes for job "preStop"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.540319625Z killing processes for job "rethinkdb-onchange"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:51.623788957Z control: serving at /var/run/containerpilot.socket
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.017980378Z consul-agent 226093 ==> Starting Consul agent...
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06214637Z consul-agent 226093 ==> Consul agent running!
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062175436Z consul-agent 226093 Version: 'v1.3.0'
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062209776Z consul-agent 226093 Node ID: 'c14b814a-faae-be15-f0d7-e56fb44d6156'
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062256946Z consul-agent 226093 Node name: '63c85685-9eb0-eb14-a707-84febcda7769'
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06227124Z consul-agent 226093 Datacenter: 'dc1' (Segment: '')
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062284342Z consul-agent 226093 Server: false (Bootstrap: false)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06232694Z consul-agent 226093 Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 53)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062340521Z consul-agent 226093 Cluster Addr: 10.1.1.116 (LAN: 8301, WAN: 8302)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06235274Z consul-agent 226093 Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062364678Z consul-agent 226093
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06237689Z consul-agent 226093 ==> Log data will now stream in as it occurs:
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062399689Z consul-agent 226093
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062411497Z consul-agent 226093 2019/01/18 23:41:52 [INFO] serf: EventMemberJoin: 63c85685-9eb0-eb14-a707-84febcda7769 10.1.1.116
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062425173Z consul-agent 226093 2019/01/18 23:41:52 [INFO] agent: Started DNS server 0.0.0.0:53 (udp)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062437048Z consul-agent 226093 2019/01/18 23:41:52 [WARN] agent/proxy: running as root, will not start managed proxies
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062455592Z consul-agent 226093 2019/01/18 23:41:52 [INFO] agent: Started DNS server 0.0.0.0:53 (tcp)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062472319Z consul-agent 226093 2019/01/18 23:41:52 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062496475Z consul-agent 226093 2019/01/18 23:41:52 [INFO] agent: started state syncer
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062567451Z consul-agent 226093 2019/01/18 23:41:52 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s os packet scaleway softlayer triton vsphere
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062641034Z consul-agent 226093 2019/01/18 23:41:52 [INFO] agent: Joining LAN cluster...
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062664965Z consul-agent 226093 2019/01/18 23:41:52 [INFO] agent: (LAN) joining: [consul.consul.svc.cloudops-dev.us-east-1.cns.cloud.iqvia.net]
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.069657015Z consul-agent 226093 2019/01/18 23:41:52 [WARN] manager: No servers available
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.069674607Z consul-agent 226093 2019/01/18 23:41:52 [ERR] agent: failed to sync remote state: No known Consul servers
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.642890839Z consul-agent 226093 2019/01/18 23:41:56 [WARN] manager: No servers available
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.642956256Z consul-agent 226093 2019/01/18 23:41:56 [ERR] http: Request GET /v1/health/service/rethinkdb?passing=1, error: No known Consul servers from=[::1]:49748
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.644287337Z failed to query rethinkdb: Unexpected response code: 500 (No known Consul servers) [<nil>]
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.693320449Z preStart 227178 2019-01-18 23:41:56 preStart: /var/lib/rethinkdb/data has data, skipping database init
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.700411971Z preStart 227178 2019-01-18 23:41:56 preStart: Rendering consul template
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.770806307Z consul-agent 226093 2019/01/18 23:41:56 [WARN] manager: No servers available
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.821432005Z consul-agent 226093 2019/01/18 23:41:56 [ERR] http: Request GET /v1/health/service/rethinkdb?passing=1&stale=&wait=60000ms, error: No known Consul servers from=127.0.0.1:63644
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.821547693Z preStart 227178 2019/01/18 23:41:56.817312 [WARN] (view) health.service(rethinkdb|passing): Unexpected response code: 500 (No known Consul servers) (retry attempt 1 after "250ms")
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.068086202Z consul-agent 226093 2019/01/18 23:41:57 [WARN] manager: No servers available
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.068121457Z consul-agent 226093 2019/01/18 23:41:57 [ERR] http: Request GET /v1/health/service/rethinkdb?passing=1&stale=&wait=60000ms, error: No known Consul servers from=127.0.0.1:63644
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.070205767Z preStart 227178 2019/01/18 23:41:57.070156 [WARN] (view) health.service(rethinkdb|passing): Unexpected response code: 500 (No known Consul servers) (retry attempt 2 after "500ms")
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.572064034Z consul-agent 226093 2019/01/18 23:41:57 [WARN] manager: No servers available
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.572104351Z consul-agent 226093 2019/01/18 23:41:57 [ERR] http: Request GET /v1/health/service/rethinkdb?passing=1&stale=&wait=60000ms, error: No known Consul servers from=127.0.0.1:63644
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.57300455Z preStart 227178 2019/01/18 23:41:57.572951 [WARN] (view) health.service(rethinkdb|passing): Unexpected response code: 500 (No known Consul servers) (retry attempt 3 after "1s")
Just hit this issue again while using Triton, this time with tcp, and ssh.
Platform
[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# uname -a
SunOS f8-f2-1e-3b-09-c4 5.11 joyent_20190314T022529Z i86pc i386 i86pc
Example
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ tt ls -l master ◼
ID NAME IMG BRAND PACKAGE STATE FLAGS PRIMARYIP CREATED
bdf777bf-f50a-e420-af1c-d3df05cf4317 job-group-task-c4f58d86 centos-7@20180323 lx sample-256M running F 10.45.136.227 2019-04-01T02:14:27.731Z
a2771ed9-5d7f-e142-ec32-c17e1cf9fee7 job-group-task-b2bba527 centos-7@20180323 lx sample-256M running F 10.45.136.239 2019-04-01T02:14:30.376Z
d524c063-4be8-eb16-e3e0-9a8c6b904e90 job-group-task-affc51c3 centos-7@20180323 lx sample-256M running F 10.45.136.96 2019-04-01T02:14:34.216Z
56e54750-b241-6568-cc5a-f02d5e914b26 job-group-task-5dfb4c9f centos-7@20180323 lx sample-256M running F 10.45.136.177 2019-04-01T02:15:06.994Z
8dc58f90-afbb-48bf-e4c4-80c2350efcc3 job-group-task-7459d4f8 centos-7@20180323 lx sample-256M running F 10.45.136.237 2019-04-01T02:15:10.770Z
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ tt fwrules master ◼
SHORTID ENABLED GLOBAL RULE
d8ddb2fb true - FROM any TO tag "fwtag" ALLOW tcp (PORT 22 AND PORT 8080)
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ tt inst get job-group-task-b2bba527 | jq .tags master ◼
{
"fwtag": "true",
"triton.cns.services": "rawrsauce"
}
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ nmap -p 22 10.45.136.239 -Pn master ◼
Starting Nmap 7.70 ( https://nmap.org ) at 2019-03-31 22:22 EDT
Nmap scan report for job-group-task-b2bba527.inst.bruce-dev.us-east-1.bdf-cloud.iqvia.net (10.45.136.239)
Host is up.
PORT STATE SERVICE
22/tcp filtered ssh
Nmap done: 1 IP address (1 host up) scanned in 2.04 seconds
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ tt inst get job-group-task-c4f58d86 | jq .tags master ◼
{
"fwtag": "true",
"triton.cns.services": "rawrsauce"
}
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ nmap -p 22 10.45.136.227 -Pn master ◼
Starting Nmap 7.70 ( https://nmap.org ) at 2019-03-31 22:23 EDT
Nmap scan report for job-group-task-c4f58d86.inst.bruce-dev.us-east-1.bdf-cloud.iqvia.net (10.45.136.227)
Host is up (0.00019s latency).
PORT STATE SERVICE
22/tcp open ssh
Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds
Enabling and Disabling the firewall through Triton / the CN clears up the non working fwrule.
Dumb question: Are these zones native or LX? If so, there are some additional tests that MIGHT be able to run.
@danmcd I was using lx with the following images:
7b5981c4-1889-11e7-b4c5-3f3bdfc9b88b
3dbbdcca-2eab-11e8-b925-23bf77789921
To add some context, I am deplpoying all of these async, so the requests are coming into CloudAPI all at the same time. Not sure if there is a race or something.
First off, thanks for information it's LX, that's useful. Doing a bit of diving (after knowing it's LX), however, it's not clear to me whether or not there's a way to find the race easily.
@danmcd I am able to reproduce this today 3 times in a row, let me know if you want me to grab some state or more information.
Lots of details up top makes it hard for me to understand the exact problem. I saw the snoops above, and the non-working one is sending packets that appear never to reach the peer (assuming your snoops are correct). Make sure you do single pings in both directions, and you do a single TCP connection in both directions while snooping. That'll help narrow things down a lot.
One thing I noticed was a CFW rule containing "(PORT 22 AND PORT 8080)". AIUI, this means the TCP traffic must contain both port 22 AND port 8080. Am I wrong? (I don't know CFW that well, it's higher-level than where I normally hang out.)
I'll try to clear up the scenario a bit.
Currently in this environment where the behavior is occurring I only have 2 CNs. And most of the time the provisioned instances are landing on the same CN from what I gathered, but not always.
I am provisioning all the instances at the same time with CloudAPI, using instance tags for the firewall rules.
When the instances came up, some of them honored the FW rule and some didn't. What is strange, is that by just disabling the firewall on 1 of the instances, all of the other instances started to honor the rule and started passing traffic.
If a disable and re-enable the fw rule they all start working fine also. So I don't think it's the way the rule is written.
I'll attempt doing this in the morning 1 provision at a time, and see if it happens.
"What is strange, is that by just disabling the firewall on 1 of the instances, all of the other instances started to honor the rule and started passing traffic."
I'm guessing fwadm may do a brute-force reset of some kind. I'm very curious if there's a way to follow the bouncing packet in a zone whose fw rules appear to be in place, but aren't, per your description earlier. (I'm happy to help with this, but it'll require global-zone dtrace access on the CN with the faulty VM.)
note: itt is aliased to triton -i
While deploying Consul with the firewall enabled I noticed a very strange behavior.
Deploying Consul Masters with firewall group "k8s_rethinkdb":
Masters Deployed and formed a healthy cluster:
Deploying the 5 RethinkDB nodes which run the consul agent who attempt to gossip on boot.:
Firewall Tags applied on Creation
fwadm list on CN that all of these instances are on:
Only 2 of the rethinkdb clients joined:
Disable the firewall rule for 1 of the clients that is in the cluster successfully:
All the other members are able to join the cluster "_"