TritonDataCenter / illumos-joyent

Community developed and maintained version of the OS/Net consolidation
http://www.illumos.org/projects/illumos-gate
266 stars 111 forks source link

Strange UDP firewall behaviour #187

Open Smithx10 opened 5 years ago

Smithx10 commented 5 years ago

note: itt is aliased to triton -i

While deploying Consul with the firewall enabled I noticed a very strange behavior.

Deploying Consul Masters with firewall group "k8s_rethinkdb":

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ kubectl apply -f triton-k8s/deployment.yml                                         master ✖ ✱ ◼
deployment.apps/consul created

Masters Deployed and formed a healthy cluster:

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ kubectl get pods                                                                   master ✖ ✱ ◼
NAME                      READY   STATUS    RESTARTS   AGE
consul-554bbdbfb5-2f4dc   1/1     Running   0          1m
consul-554bbdbfb5-k8rpq   1/1     Running   0          1m
consul-554bbdbfb5-lp685   1/1     Running   0          1m
consul-554bbdbfb5-tk5mv   1/1     Running   0          1m
consul-554bbdbfb5-zgrwf   1/1     Running   0          1m

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ triton -i ls | grep consul                                                   ✘ 130 master ✖ ✱ ◼
ab9c81e1  consul-554bbdbfb5-tk5mv  img-consul-k8s@1547829591        running  F      2m
58445389  consul-554bbdbfb5-k8rpq  img-consul-k8s@1547829591        running  F      2m
a50377a6  consul-554bbdbfb5-zgrwf  img-consul-k8s@1547829591        running  F      2m
aa5ada96  consul-554bbdbfb5-lp685  img-consul-k8s@1547829591        running  F      2m
c8d4dddc  consul-554bbdbfb5-2f4dc  img-consul-k8s@1547829591        running  F      2m

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ consul members                                                                     master ✖ ✱ ◼
Node                                  Address          Status  Type    Build  Protocol  DC   Segment
58445389-5efd-4993-81fe-f2fe7c2d67de  10.1.1.101:8301  alive   server  1.3.0  2         dc1  <all>
a50377a6-cb4a-e131-848c-d87f014226e4  10.1.1.102:8301  alive   server  1.3.0  2         dc1  <all>
aa5ada96-e11d-e64a-ed75-a99964e574c8  10.1.1.99:8301   alive   server  1.3.0  2         dc1  <all>
ab9c81e1-2c00-c66e-9577-d38f2026a0a0  10.1.1.103:8301  alive   server  1.3.0  2         dc1  <all>
c8d4dddc-98d6-ccfb-ec43-a8ad294d6903  10.1.1.100:8301  alive   server  1.3.0  2         dc1  <all>

Deploying the 5 RethinkDB nodes which run the consul agent who attempt to gossip on boot.:

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/rethinkdb ❯❯❯ kubectl apply -f triton-k8s/deployment.yml                                      master ✖ ✱ ◼
deployment.apps/rethinkdb created

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/rethinkdb ❯❯❯ kubectl get pods                                                                master ✖ ✱ ◼
NAME                       READY   STATUS    RESTARTS   AGE
consul-554bbdbfb5-2f4dc    1/1     Running   0          9m
consul-554bbdbfb5-k8rpq    1/1     Running   0          9m
consul-554bbdbfb5-lp685    1/1     Running   0          9m
consul-554bbdbfb5-tk5mv    1/1     Running   0          9m
consul-554bbdbfb5-zgrwf    1/1     Running   0          9m
rethinkdb-66744f7d-22s97   1/1     Running   0          1m
rethinkdb-66744f7d-8mpdn   1/1     Running   0          1m
rethinkdb-66744f7d-kl722   1/1     Running   0          1m
rethinkdb-66744f7d-l52sv   1/1     Running   0          1m
rethinkdb-66744f7d-zv6n9   1/1     Running   0          1m

Firewall Tags applied on Creation

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ itt inst ls  | grep consul | awk '{print $1}' | xargs -i triton -i inst get {} | grep k8s_rethinkdb
        "k8s_rethinkdb": "true",
        "k8s_rethinkdb": "true",
        "k8s_rethinkdb": "true",
        "k8s_rethinkdb": "true",
        "k8s_rethinkdb": "true",

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/rethinkdb ❯❯❯ itt inst ls  | grep rethinkdb | awk '{print $1}' | xargs -i triton -i inst get {} | grep k8s_rethinkdb
        "k8s_rethinkdb": "true",
        "k8s_rethinkdb": "true",
        "k8s_rethinkdb": "true",
        "k8s_rethinkdb": "true",
        "k8s_rethinkdb": "true",

fwadm list on CN that all of these instances are on:

[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# fwadm list | grep k8s_rethinkdb
02a0f947-675c-4ef3-ac6f-c2ecee9b7d16 true    FROM tag "k8s_rethinkdb" TO tag "k8s_rethinkdb" ALLOW udp PORT all      
5e303eb7-d628-410d-95fa-512d20d49040 true    FROM tag "k8s_rethinkdb" TO tag "k8s_rethinkdb" ALLOW tcp PORT all     

[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# fwadm vms 02a0f947-675c-4ef3-ac6f-c2ecee9b7d16
ab9c81e1-2c00-c66e-9577-d38f2026a0a0
58445389-5efd-4993-81fe-f2fe7c2d67de
a50377a6-cb4a-e131-848c-d87f014226e4
aa5ada96-e11d-e64a-ed75-a99964e574c8
c8d4dddc-98d6-ccfb-ec43-a8ad294d6903
f5838f85-a14f-cff1-c0c0-b55acec30eed
803cc5d6-dc37-cb54-ed0d-a05a795684fa
9b11635d-453e-6ad4-f525-cf504ae5a541
06240bbb-7e53-47ce-b0d7-b892806a3f4f
038e8b85-9759-6f09-cefd-98943f2b33d6

Only 2 of the rethinkdb clients joined:

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ consul members                                                                     master ✖ ✱ ◼
Node                                  Address          Status  Type    Build  Protocol  DC   Segment
58445389-5efd-4993-81fe-f2fe7c2d67de  10.1.1.101:8301  alive   server  1.3.0  2         dc1  <all>
a50377a6-cb4a-e131-848c-d87f014226e4  10.1.1.102:8301  alive   server  1.3.0  2         dc1  <all>
aa5ada96-e11d-e64a-ed75-a99964e574c8  10.1.1.99:8301   alive   server  1.3.0  2         dc1  <all>
ab9c81e1-2c00-c66e-9577-d38f2026a0a0  10.1.1.103:8301  alive   server  1.3.0  2         dc1  <all>
c8d4dddc-98d6-ccfb-ec43-a8ad294d6903  10.1.1.100:8301  alive   server  1.3.0  2         dc1  <all>
038e8b85-9759-6f09-cefd-98943f2b33d6  10.1.1.105:8301  alive   client  1.3.0  2         dc1  <default>
06240bbb-7e53-47ce-b0d7-b892806a3f4f  10.1.1.108:8301  alive   client  1.3.0  2         dc1  <default>

Disable the firewall rule for 1 of the clients that is in the cluster successfully:

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ itt inst disable-firewall 06240bbb-7e53-47ce-b0d7-b892806a3f4f                     master ✖ ✱ ◼
Disabling firewall for instance "06240bbb-7e53-47ce-b0d7-b892806a3f4f"

All the other members are able to join the cluster "_"

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ consul members                                                                     master ✖ ✱ ◼
Node                                  Address          Status  Type    Build  Protocol  DC   Segment
58445389-5efd-4993-81fe-f2fe7c2d67de  10.1.1.101:8301  alive   server  1.3.0  2         dc1  <all>
a50377a6-cb4a-e131-848c-d87f014226e4  10.1.1.102:8301  alive   server  1.3.0  2         dc1  <all>
aa5ada96-e11d-e64a-ed75-a99964e574c8  10.1.1.99:8301   alive   server  1.3.0  2         dc1  <all>
ab9c81e1-2c00-c66e-9577-d38f2026a0a0  10.1.1.103:8301  alive   server  1.3.0  2         dc1  <all>
c8d4dddc-98d6-ccfb-ec43-a8ad294d6903  10.1.1.100:8301  alive   server  1.3.0  2         dc1  <all>
038e8b85-9759-6f09-cefd-98943f2b33d6  10.1.1.105:8301  alive   client  1.3.0  2         dc1  <default>
06240bbb-7e53-47ce-b0d7-b892806a3f4f  10.1.1.108:8301  alive   client  1.3.0  2         dc1  <default>
803cc5d6-dc37-cb54-ed0d-a05a795684fa  10.1.1.106:8301  alive   client  1.3.0  2         dc1  <default>
9b11635d-453e-6ad4-f525-cf504ae5a541  10.1.1.107:8301  alive   client  1.3.0  2         dc1  <default>
f5838f85-a14f-cff1-c0c0-b55acec30eed  10.1.1.104:8301  alive   client  1.3.0  2         dc1  <default>
Smithx10 commented 5 years ago

All Gossip for these nodes are running over 8301

ipfstat -t -P udp -G 9b11635d-453e-6ad4-f525-cf504ae5a541

Src: 0.0.0.0, Dest: 0.0.0.0, Proto: udp, Sorted by: # bytes
Source IP             Destination IP         ST   PR   #pkts    #bytes       ttl
10.1.1.107,8301       10.1.1.99,8301        0/0  udp      74     11032      1:59
10.1.1.107,8301       10.1.1.108,8301       0/0  udp      75     10888      1:57
10.1.1.107,8301       10.1.1.100,8301       0/0  udp      73     10487      0:10
10.1.1.107,8301       10.1.1.101,8301       0/0  udp      72     10332      0:03
10.1.1.107,8301       10.1.1.106,8301       0/0  udp      71      9591      0:06
10.1.1.107,8301       10.1.1.104,8301       0/0  udp      58      7565      1:54
10.1.1.107,8301       10.1.1.103,8301       0/0  udp      30      3938      1:59
10.1.1.107,8301       10.1.1.105,8301       0/0  udp      14      1841      1:54
10.1.1.107,8301       10.1.1.102,8301       0/0  udp       1        83      0:12
Smithx10 commented 5 years ago

At this point I had to redeploy the same configuration as before since I implemented the fix by disabling a rule.

Here is a updated list of addresses / instances:

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/rethinkdb ❯❯❯ itt ls -l  | grep 'rethinkdb\|consul'                                                                                  ✘ 130 master ✖ ✱ ◼
2c70e9f0-fcdb-6634-aa51-e8d8ce367a5d  consul-554bbdbfb5-pfd5s   img-consul-k8s@1547829591        lx      sample-512M  running  F      10.45.137.29   2019-01-18T22:37:13.876Z
01cfcf81-b2ac-6251-9c48-c2821449234e  consul-554bbdbfb5-876r2   img-consul-k8s@1547829591        lx      sample-512M  running  F      10.45.137.11   2019-01-18T22:37:16.550Z
841fc84f-1a39-4148-b923-8138b290b84d  consul-554bbdbfb5-ztcq9   img-consul-k8s@1547829591        lx      sample-512M  running  F      10.45.137.27   2019-01-18T22:37:19.062Z
181ad4f2-4bef-6536-e821-9da8923f164b  consul-554bbdbfb5-r8h2j   img-consul-k8s@1547829591        lx      sample-512M  running  F      10.45.137.25   2019-01-18T22:37:23.129Z
1171054c-f076-c128-9415-f1595822509f  consul-554bbdbfb5-gd95b   img-consul-k8s@1547829591        lx      sample-512M  running  F      10.45.137.28   2019-01-18T22:37:24.393Z
e5833a95-4c29-cf4d-cfa4-887d2ddc6514  rethinkdb-66744f7d-f9cl7  img-rethinkdb-master@1547844250  lx      sample-512M  running  F      10.45.137.22   2019-01-18T22:38:18.924Z
930436bf-5dac-648b-c406-9fa784bdec5f  rethinkdb-66744f7d-zjgpg  img-rethinkdb-master@1547844250  lx      sample-512M  running  F      10.45.137.21   2019-01-18T22:38:22.228Z
d5b48f22-db56-4e22-a744-c6df3255d3ab  rethinkdb-66744f7d-mjnjg  img-rethinkdb-master@1547844250  lx      sample-512M  running  F      10.45.137.20   2019-01-18T22:38:25.932Z
63c85685-9eb0-eb14-a707-84febcda7769  rethinkdb-66744f7d-sm6hl  img-rethinkdb-master@1547844250  lx      sample-512M  running  F      10.45.137.24   2019-01-18T22:38:30.034Z
419bc275-4306-42ca-fc40-c06a21ee86be  rethinkdb-66744f7d-z5qvr  img-rethinkdb-master@1547844250  lx      sample-512M  running  F      10.45.137.26   2019-01-18T22:38:31.871Z

ubuntu@f1573f74-23e2-682f-8f96-c1e5f8bc3a35 /g/t/s/consul ❯❯❯ consul members                                                                                                                  master ✖ ✱ ◼
Node                                  Address          Status  Type    Build  Protocol  DC   Segment
01cfcf81-b2ac-6251-9c48-c2821449234e  10.1.1.109:8301  alive   server  1.3.0  2         dc1  <all>
1171054c-f076-c128-9415-f1595822509f  10.1.1.113:8301  alive   server  1.3.0  2         dc1  <all>
181ad4f2-4bef-6536-e821-9da8923f164b  10.1.1.111:8301  alive   server  1.3.0  2         dc1  <all>
2c70e9f0-fcdb-6634-aa51-e8d8ce367a5d  10.1.1.110:8301  alive   server  1.3.0  2         dc1  <all>
841fc84f-1a39-4148-b923-8138b290b84d  10.1.1.112:8301  alive   server  1.3.0  2         dc1  <all>
419bc275-4306-42ca-fc40-c06a21ee86be  10.1.1.117:8301  alive   client  1.3.0  2         dc1  <default>

Working Client Rules:

[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# ipfstat -nio -G 419bc275-4306-42ca-fc40-c06a21ee86be
@1 pass out quick proto tcp from any to any flags S/SA keep state
@2 pass out proto tcp from any to any
@3 pass out proto udp from any to any keep state
@4 pass out quick proto icmp from any to any keep state
@5 pass out proto icmp from any to any
@1 pass in quick proto icmp from any to any keep frags
@2 pass in quick proto tcp from 10.1.1.109/32 to any keep frags
@3 pass in quick proto tcp from 10.45.137.11/32 to any keep frags
@4 pass in quick proto tcp from 10.1.1.110/32 to any keep frags
@5 pass in quick proto tcp from 10.45.137.29/32 to any keep frags
@6 pass in quick proto tcp from 10.1.1.111/32 to any keep frags
@7 pass in quick proto tcp from 10.45.137.25/32 to any keep frags
@8 pass in quick proto tcp from 10.1.1.112/32 to any keep frags
@9 pass in quick proto tcp from 10.45.137.27/32 to any keep frags
@10 pass in quick proto tcp from 10.1.1.113/32 to any keep frags
@11 pass in quick proto tcp from 10.45.137.28/32 to any keep frags
@12 pass in quick proto tcp from 10.1.1.114/32 to any keep frags
@13 pass in quick proto tcp from 10.45.137.21/32 to any keep frags
@14 pass in quick proto tcp from 10.1.1.115/32 to any keep frags
@15 pass in quick proto tcp from 10.45.137.20/32 to any keep frags
@16 pass in quick proto tcp from 10.1.1.116/32 to any keep frags
@17 pass in quick proto tcp from 10.45.137.24/32 to any keep frags
@18 pass in quick proto tcp from 10.1.1.117/32 to any keep frags
@19 pass in quick proto tcp from 10.45.137.26/32 to any keep frags
@20 pass in quick proto tcp from 10.1.1.118/32 to any keep frags
@21 pass in quick proto tcp from 10.45.137.22/32 to any keep frags
@22 pass in quick proto tcp from any to any port = ssh keep frags
@23 pass in quick proto tcp from any to any port = http-alt keep frags
@24 pass in quick proto tcp from any to any port = 28015 keep frags
@25 pass in quick proto udp from 10.1.1.109/32 to any keep frags
@26 pass in quick proto udp from 10.45.137.11/32 to any keep frags
@27 pass in quick proto udp from 10.1.1.110/32 to any keep frags
@28 pass in quick proto udp from 10.45.137.29/32 to any keep frags
@29 pass in quick proto udp from 10.1.1.111/32 to any keep frags
@30 pass in quick proto udp from 10.45.137.25/32 to any keep frags
@31 pass in quick proto udp from 10.1.1.112/32 to any keep frags
@32 pass in quick proto udp from 10.45.137.27/32 to any keep frags
@33 pass in quick proto udp from 10.1.1.113/32 to any keep frags
@34 pass in quick proto udp from 10.45.137.28/32 to any keep frags
@35 pass in quick proto udp from 10.1.1.114/32 to any keep frags
@36 pass in quick proto udp from 10.45.137.21/32 to any keep frags
@37 pass in quick proto udp from 10.1.1.115/32 to any keep frags
@38 pass in quick proto udp from 10.45.137.20/32 to any keep frags
@39 pass in quick proto udp from 10.1.1.116/32 to any keep frags
@40 pass in quick proto udp from 10.45.137.24/32 to any keep frags
@41 pass in quick proto udp from 10.1.1.117/32 to any keep frags
@42 pass in quick proto udp from 10.45.137.26/32 to any keep frags
@43 pass in quick proto udp from 10.1.1.118/32 to any keep frags
@44 pass in quick proto udp from 10.45.137.22/32 to any keep frags
@45 block in all

Non Working Client Rules:

[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# ipfstat -nio -G 63c85685-9eb0-eb14-a707-84febcda7769
@1 pass out quick proto tcp from any to any flags S/SA keep state
@2 pass out proto tcp from any to any
@3 pass out proto udp from any to any keep state
@4 pass out quick proto icmp from any to any keep state
@5 pass out proto icmp from any to any
@1 pass in quick proto icmp from any to any keep frags
@2 pass in quick proto tcp from 10.1.1.109/32 to any keep frags
@3 pass in quick proto tcp from 10.45.137.11/32 to any keep frags
@4 pass in quick proto tcp from 10.1.1.110/32 to any keep frags
@5 pass in quick proto tcp from 10.45.137.29/32 to any keep frags
@6 pass in quick proto tcp from 10.1.1.111/32 to any keep frags
@7 pass in quick proto tcp from 10.45.137.25/32 to any keep frags
@8 pass in quick proto tcp from 10.1.1.112/32 to any keep frags
@9 pass in quick proto tcp from 10.45.137.27/32 to any keep frags
@10 pass in quick proto tcp from 10.1.1.113/32 to any keep frags
@11 pass in quick proto tcp from 10.45.137.28/32 to any keep frags
@12 pass in quick proto tcp from 10.1.1.114/32 to any keep frags
@13 pass in quick proto tcp from 10.45.137.21/32 to any keep frags
@14 pass in quick proto tcp from 10.1.1.115/32 to any keep frags
@15 pass in quick proto tcp from 10.45.137.20/32 to any keep frags
@16 pass in quick proto tcp from 10.1.1.116/32 to any keep frags
@17 pass in quick proto tcp from 10.45.137.24/32 to any keep frags
@18 pass in quick proto tcp from 10.1.1.117/32 to any keep frags
@19 pass in quick proto tcp from 10.45.137.26/32 to any keep frags
@20 pass in quick proto tcp from 10.1.1.118/32 to any keep frags
@21 pass in quick proto tcp from 10.45.137.22/32 to any keep frags
@22 pass in quick proto tcp from any to any port = ssh keep frags
@23 pass in quick proto tcp from any to any port = http-alt keep frags
@24 pass in quick proto tcp from any to any port = 28015 keep frags
@25 pass in quick proto udp from 10.1.1.109/32 to any keep frags
@26 pass in quick proto udp from 10.45.137.11/32 to any keep frags
@27 pass in quick proto udp from 10.1.1.110/32 to any keep frags
@28 pass in quick proto udp from 10.45.137.29/32 to any keep frags
@29 pass in quick proto udp from 10.1.1.111/32 to any keep frags
@30 pass in quick proto udp from 10.45.137.25/32 to any keep frags
@31 pass in quick proto udp from 10.1.1.112/32 to any keep frags
@32 pass in quick proto udp from 10.45.137.27/32 to any keep frags
@33 pass in quick proto udp from 10.1.1.113/32 to any keep frags
@34 pass in quick proto udp from 10.45.137.28/32 to any keep frags
@35 pass in quick proto udp from 10.1.1.114/32 to any keep frags
@36 pass in quick proto udp from 10.45.137.21/32 to any keep frags
@37 pass in quick proto udp from 10.1.1.115/32 to any keep frags
@38 pass in quick proto udp from 10.45.137.20/32 to any keep frags
@39 pass in quick proto udp from 10.1.1.116/32 to any keep frags
@40 pass in quick proto udp from 10.45.137.24/32 to any keep frags
@41 pass in quick proto udp from 10.1.1.117/32 to any keep frags
@42 pass in quick proto udp from 10.45.137.26/32 to any keep frags
@43 pass in quick proto udp from 10.1.1.118/32 to any keep frags
@44 pass in quick proto udp from 10.45.137.22/32 to any keep frags
@45 block in all
Smithx10 commented 5 years ago

Snoop from the working and non working zones:

[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# snoop -z 419bc275-4306-42ca-fc40-c06a21ee86be -d eth1
Using device eth1 (promiscuous mode)
  10.1.1.117 -> 10.1.1.109   UDP D=8301 S=8301 LEN=63
  10.1.1.109 -> 10.1.1.117   UDP D=8301 S=8301 LEN=160
  10.1.1.117 -> *            ARP C Who is 10.1.1.112, 10.1.1.112 ?
  10.1.1.112 -> 10.1.1.117   ARP R 10.1.1.112, 10.1.1.112 is 90:b8:d0:39:6a:59
  10.1.1.117 -> 10.1.1.111   UDP D=8301 S=8301 LEN=63
  10.1.1.112 -> *            ARP C Who is 10.1.1.117, 10.1.1.117 ?
  10.1.1.117 -> 10.1.1.112   ARP R 10.1.1.117, 10.1.1.117 is 90:b8:d0:49:76:5
  10.1.1.111 -> 10.1.1.117   UDP D=8301 S=8301 LEN=160
  10.1.1.117 -> 10.1.1.110   TCP D=8300 S=56064 Push Ack=379882189 Seq=2261364876 Len=12 Win=33000 Options=<nop,nop,tstamp 445712767 445707768>
  10.1.1.117 -> 10.1.1.110   TCP D=8300 S=56064 Push Ack=379882189 Seq=2261364888 Len=338 Win=33000 Options=<nop,nop,tstamp 445712767 445707768>
  10.1.1.110 -> 10.1.1.117   TCP D=56064 S=8300 Ack=2261365226 Seq=379882189 Len=0 Win=33000 Options=<nop,nop,tstamp 445712767 445712767>
  10.1.1.110 -> 10.1.1.117   TCP D=56064 S=8300 Push Ack=2261365226 Seq=379882189 Len=12 Win=33000 Options=<nop,nop,tstamp 445712767 445712767>
  10.1.1.110 -> 10.1.1.117   TCP D=56064 S=8300 Push Ack=2261365226 Seq=379882201 Len=1491 Win=33000 Options=<nop,nop,tstamp 445712767 445712767>
  10.1.1.117 -> 10.1.1.110   TCP D=8300 S=56064 Ack=379883692 Seq=2261365226 Len=0 Win=33000 Options=<nop,nop,tstamp 445712767 445712767>
  10.1.1.109 -> 10.1.1.117   UDP D=8301 S=8301 LEN=63
  10.1.1.117 -> 10.1.1.109   UDP D=8301 S=8301 LEN=160
^C[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# snoop -z 63c85685-9eb0-eb14-a707-84febcda7769 -d eth1
Using device eth1 (promiscuous mode)
  10.1.1.113 -> 10.1.1.117   ARP R 10.1.1.113, 10.1.1.113 is 90:b8:d0:cd:5d:f6

  10.1.1.116 -> 10.1.1.111   TCP D=8301 S=43198 Syn Seq=3786205279 Len=0 Win=32782 Options=<mss 8460,sackOK,tstamp 445737587 0,nop,wscale 5>
  10.1.1.117 -> 10.1.1.109   ARP R 10.1.1.117, 10.1.1.117 is 90:b8:d0:49:76:5
  10.1.1.109 -> 10.1.1.117   ARP R 10.1.1.109, 10.1.1.109 is 90:b8:d0:c2:f:48
  10.1.1.113 -> 10.1.1.115   ARP R 10.1.1.113, 10.1.1.113 is 90:b8:d0:cd:5d:f6
  10.1.1.116 -> 10.1.1.111   TCP D=8301 S=43198 Syn Seq=3786205279 Len=0 Win=32782 Options=<mss 8460,sackOK,tstamp 445738717 0,nop,wscale 5>
  10.1.1.112 -> 10.1.1.114   ARP R 10.1.1.112, 10.1.1.112 is 90:b8:d0:39:6a:59
  10.1.1.116 -> 10.1.1.111   TCP D=8301 S=43198 Syn Seq=3786205279 Len=0 Win=32782 Options=<mss 8460,sackOK,tstamp 445740977 0,nop,wscale 5>
  10.1.1.116 -> *            ARP C Who is 10.1.1.111, 10.1.1.111 ?
  10.1.1.111 -> 10.1.1.116   ARP R 10.1.1.111, 10.1.1.111 is 90:b8:d0:bf:7f:eb
Smithx10 commented 5 years ago

dig for the cname used to join:

[root@63c85685-9eb0-eb14-a707-84febcda7769 ~]# dig consul.consul.svc.cloudops-dev.us-east-1.cns.cloud.iqvia.net +short 
10.1.1.110
10.1.1.112
10.1.1.113
10.1.1.111
10.1.1.109

restart of a non joined node:

[root@63c85685-9eb0-eb14-a707-84febcda7769 ~]# systemctl restart containerpilot && journalctl -u containerpilot -fl
-- Logs begin at Fri 2019-01-18 22:38:42 UTC. --
Jan 18 23:41:46 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:46.514035849Z consul-agent 854829     2019/01/18 23:41:46 [INFO] agent: Waiting for endpoints to shut down
Jan 18 23:41:46 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:46.514063947Z consul-agent 854829     2019/01/18 23:41:46 [INFO] agent: Endpoints down
Jan 18 23:41:46 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:46.514085208Z consul-agent 854829     2019/01/18 23:41:46 [INFO] agent: Exit code: 0
Jan 18 23:41:46 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:46.51426164Z preStop 224222 Graceful leave complete
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.540190595Z killing processes for job "preStart"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.54025145Z killing processes for job "rethinkdb"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.540271279Z killing processes for job "rethinkdb-ui"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.54028618Z killing processes for job "consul-agent"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.540302616Z killing processes for job "preStop"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[854799]: 2019-01-18T23:41:51.540319625Z killing processes for job "rethinkdb-onchange"
Jan 18 23:41:51 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:51.623788957Z control: serving at /var/run/containerpilot.socket
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.017980378Z consul-agent 226093 ==> Starting Consul agent...
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06214637Z consul-agent 226093 ==> Consul agent running!
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062175436Z consul-agent 226093            Version: 'v1.3.0'
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062209776Z consul-agent 226093            Node ID: 'c14b814a-faae-be15-f0d7-e56fb44d6156'
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062256946Z consul-agent 226093          Node name: '63c85685-9eb0-eb14-a707-84febcda7769'
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06227124Z consul-agent 226093         Datacenter: 'dc1' (Segment: '')
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062284342Z consul-agent 226093             Server: false (Bootstrap: false)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06232694Z consul-agent 226093        Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 53)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062340521Z consul-agent 226093       Cluster Addr: 10.1.1.116 (LAN: 8301, WAN: 8302)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06235274Z consul-agent 226093            Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062364678Z consul-agent 226093
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.06237689Z consul-agent 226093 ==> Log data will now stream in as it occurs:
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062399689Z consul-agent 226093
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062411497Z consul-agent 226093     2019/01/18 23:41:52 [INFO] serf: EventMemberJoin: 63c85685-9eb0-eb14-a707-84febcda7769 10.1.1.116
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062425173Z consul-agent 226093     2019/01/18 23:41:52 [INFO] agent: Started DNS server 0.0.0.0:53 (udp)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062437048Z consul-agent 226093     2019/01/18 23:41:52 [WARN] agent/proxy: running as root, will not start managed proxies
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062455592Z consul-agent 226093     2019/01/18 23:41:52 [INFO] agent: Started DNS server 0.0.0.0:53 (tcp)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062472319Z consul-agent 226093     2019/01/18 23:41:52 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062496475Z consul-agent 226093     2019/01/18 23:41:52 [INFO] agent: started state syncer
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062567451Z consul-agent 226093     2019/01/18 23:41:52 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s os packet scaleway softlayer triton vsphere
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062641034Z consul-agent 226093     2019/01/18 23:41:52 [INFO] agent: Joining LAN cluster...
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.062664965Z consul-agent 226093     2019/01/18 23:41:52 [INFO] agent: (LAN) joining: [consul.consul.svc.cloudops-dev.us-east-1.cns.cloud.iqvia.net]
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.069657015Z consul-agent 226093     2019/01/18 23:41:52 [WARN] manager: No servers available
Jan 18 23:41:52 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:52.069674607Z consul-agent 226093     2019/01/18 23:41:52 [ERR] agent: failed to sync remote state: No known Consul servers
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.642890839Z consul-agent 226093     2019/01/18 23:41:56 [WARN] manager: No servers available
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.642956256Z consul-agent 226093     2019/01/18 23:41:56 [ERR] http: Request GET /v1/health/service/rethinkdb?passing=1, error: No known Consul servers from=[::1]:49748
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.644287337Z failed to query rethinkdb: Unexpected response code: 500 (No known Consul servers) [<nil>]
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.693320449Z preStart 227178     2019-01-18 23:41:56 preStart: /var/lib/rethinkdb/data has data, skipping database init
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.700411971Z preStart 227178     2019-01-18 23:41:56 preStart: Rendering consul template
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.770806307Z consul-agent 226093     2019/01/18 23:41:56 [WARN] manager: No servers available
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.821432005Z consul-agent 226093     2019/01/18 23:41:56 [ERR] http: Request GET /v1/health/service/rethinkdb?passing=1&stale=&wait=60000ms, error: No known Consul servers from=127.0.0.1:63644
Jan 18 23:41:56 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:56.821547693Z preStart 227178 2019/01/18 23:41:56.817312 [WARN] (view) health.service(rethinkdb|passing): Unexpected response code: 500 (No known Consul servers) (retry attempt 1 after "250ms")
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.068086202Z consul-agent 226093     2019/01/18 23:41:57 [WARN] manager: No servers available
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.068121457Z consul-agent 226093     2019/01/18 23:41:57 [ERR] http: Request GET /v1/health/service/rethinkdb?passing=1&stale=&wait=60000ms, error: No known Consul servers from=127.0.0.1:63644
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.070205767Z preStart 227178 2019/01/18 23:41:57.070156 [WARN] (view) health.service(rethinkdb|passing): Unexpected response code: 500 (No known Consul servers) (retry attempt 2 after "500ms")
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.572064034Z consul-agent 226093     2019/01/18 23:41:57 [WARN] manager: No servers available
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.572104351Z consul-agent 226093     2019/01/18 23:41:57 [ERR] http: Request GET /v1/health/service/rethinkdb?passing=1&stale=&wait=60000ms, error: No known Consul servers from=127.0.0.1:63644
Jan 18 23:41:57 63c85685-9eb0-eb14-a707-84febcda7769 containerpilot[226067]: 2019-01-18T23:41:57.57300455Z preStart 227178 2019/01/18 23:41:57.572951 [WARN] (view) health.service(rethinkdb|passing): Unexpected response code: 500 (No known Consul servers) (retry attempt 3 after "1s")
Smithx10 commented 5 years ago

Just hit this issue again while using Triton, this time with tcp, and ssh.

Platform

[root@f8-f2-1e-3b-09-c4 (us-east-1) ~]# uname -a
SunOS f8-f2-1e-3b-09-c4 5.11 joyent_20190314T022529Z i86pc i386 i86pc

Example

arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ tt ls -l                                                                                                                                                                                                                              master ◼
ID                                    NAME                     IMG                BRAND  PACKAGE      STATE    FLAGS  PRIMARYIP      CREATED
bdf777bf-f50a-e420-af1c-d3df05cf4317  job-group-task-c4f58d86  centos-7@20180323  lx     sample-256M  running  F      10.45.136.227  2019-04-01T02:14:27.731Z
a2771ed9-5d7f-e142-ec32-c17e1cf9fee7  job-group-task-b2bba527  centos-7@20180323  lx     sample-256M  running  F      10.45.136.239  2019-04-01T02:14:30.376Z
d524c063-4be8-eb16-e3e0-9a8c6b904e90  job-group-task-affc51c3  centos-7@20180323  lx     sample-256M  running  F      10.45.136.96   2019-04-01T02:14:34.216Z
56e54750-b241-6568-cc5a-f02d5e914b26  job-group-task-5dfb4c9f  centos-7@20180323  lx     sample-256M  running  F      10.45.136.177  2019-04-01T02:15:06.994Z
8dc58f90-afbb-48bf-e4c4-80c2350efcc3  job-group-task-7459d4f8  centos-7@20180323  lx     sample-256M  running  F      10.45.136.237  2019-04-01T02:15:10.770Z
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ tt fwrules                                                                                                                                                                                                                            master ◼
SHORTID   ENABLED  GLOBAL  RULE
d8ddb2fb  true     -       FROM any TO tag "fwtag" ALLOW tcp (PORT 22 AND PORT 8080)
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ tt inst get job-group-task-b2bba527 | jq .tags                                                                                                                                                                                        master ◼
{
  "fwtag": "true",
  "triton.cns.services": "rawrsauce"
}
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ nmap -p 22 10.45.136.239 -Pn                                                                                                                                                                                                          master ◼
Starting Nmap 7.70 ( https://nmap.org ) at 2019-03-31 22:22 EDT
Nmap scan report for job-group-task-b2bba527.inst.bruce-dev.us-east-1.bdf-cloud.iqvia.net (10.45.136.239)
Host is up.

PORT   STATE    SERVICE
22/tcp filtered ssh

Nmap done: 1 IP address (1 host up) scanned in 2.04 seconds
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ tt inst get job-group-task-c4f58d86 | jq .tags                                                                                                                                                                                        master ◼
{
  "fwtag": "true",
  "triton.cns.services": "rawrsauce"
}
arch@archlinux ~/g/s/g/nomad-experiment ❯❯❯ nmap -p 22 10.45.136.227 -Pn                                                                                                                                                                                                          master ◼
Starting Nmap 7.70 ( https://nmap.org ) at 2019-03-31 22:23 EDT
Nmap scan report for job-group-task-c4f58d86.inst.bruce-dev.us-east-1.bdf-cloud.iqvia.net (10.45.136.227)
Host is up (0.00019s latency).

PORT   STATE SERVICE
22/tcp open  ssh

Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds
Smithx10 commented 5 years ago

Enabling and Disabling the firewall through Triton / the CN clears up the non working fwrule.

danmcd commented 5 years ago

Dumb question: Are these zones native or LX? If so, there are some additional tests that MIGHT be able to run.

Smithx10 commented 5 years ago

@danmcd I was using lx with the following images:

7b5981c4-1889-11e7-b4c5-3f3bdfc9b88b
3dbbdcca-2eab-11e8-b925-23bf77789921

To add some context, I am deplpoying all of these async, so the requests are coming into CloudAPI all at the same time. Not sure if there is a race or something.

danmcd commented 5 years ago

First off, thanks for information it's LX, that's useful. Doing a bit of diving (after knowing it's LX), however, it's not clear to me whether or not there's a way to find the race easily.

Smithx10 commented 5 years ago

@danmcd I am able to reproduce this today 3 times in a row, let me know if you want me to grab some state or more information.

danmcd commented 5 years ago

Lots of details up top makes it hard for me to understand the exact problem. I saw the snoops above, and the non-working one is sending packets that appear never to reach the peer (assuming your snoops are correct). Make sure you do single pings in both directions, and you do a single TCP connection in both directions while snooping. That'll help narrow things down a lot.

One thing I noticed was a CFW rule containing "(PORT 22 AND PORT 8080)". AIUI, this means the TCP traffic must contain both port 22 AND port 8080. Am I wrong? (I don't know CFW that well, it's higher-level than where I normally hang out.)

Smithx10 commented 5 years ago

I'll try to clear up the scenario a bit.

Currently in this environment where the behavior is occurring I only have 2 CNs. And most of the time the provisioned instances are landing on the same CN from what I gathered, but not always.

I am provisioning all the instances at the same time with CloudAPI, using instance tags for the firewall rules.

When the instances came up, some of them honored the FW rule and some didn't. What is strange, is that by just disabling the firewall on 1 of the instances, all of the other instances started to honor the rule and started passing traffic.

If a disable and re-enable the fw rule they all start working fine also. So I don't think it's the way the rule is written.

I'll attempt doing this in the morning 1 provision at a time, and see if it happens.

danmcd commented 5 years ago

"What is strange, is that by just disabling the firewall on 1 of the instances, all of the other instances started to honor the rule and started passing traffic."

I'm guessing fwadm may do a brute-force reset of some kind. I'm very curious if there's a way to follow the bouncing packet in a zone whose fw rules appear to be in place, but aren't, per your description earlier. (I'm happy to help with this, but it'll require global-zone dtrace access on the CN with the faulty VM.)