CiscoDevNet / Hyperflex-Hypercheck

Perform pro-active self checks on your Hyperflex cluster to ensure stability and resiliency
MIT License
27 stars 18 forks source link

vmkping check with -i doesn't work in rare situations #18

Closed ssurpurcisco closed 4 years ago

ssurpurcisco commented 4 years ago

[root@hci01p02rv:~] vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 10.100.208.3 PING 10.100.208.3 (10.100.208.3): 8972 data bytes

--- 10.100.208.3 ping statistics --- 3 packets transmitted, 0 packets received, 100% packet loss

Same when retried after first one failed returns success..

Something to do with ARPs, https://communities.vmware.com/thread/473679 , see if the "0.05" can be modified to no interval or something like 1 sec..

[root@hci01p02rv:~] vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 10.100.208.3 PING 10.100.208.3 (10.100.208.3): 8972 data bytes 8980 bytes from 10.100.208.3: icmp_seq=0 ttl=64 time=0.303 ms 8980 bytes from 10.100.208.3: icmp_seq=1 ttl=64 time=0.190 ms 8980 bytes from 10.100.208.3: icmp_seq=2 ttl=64 time=0.228 ms

--- 10.100.208.3 ping statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max = 0.190/0.240/0.303 ms [root@hci01p02rv:~]

jgc234 commented 4 years ago

I'm seeing a similar problem on multiple hosts (6.5U1 and 6.7U3). The vmkping commands to test the vmotion network connectivity are intermittently failing, yet when run again they consistently pass. The problem can be reproduced outside of the test script (as above). The ARP theory sounds great (the interval timer is too short, and the ARP reply doesn't get back before the 3 packets at 0.05 seconds finish, hence fail) but I had a look at the ARP tables on the ESXi host (esxcli network ip neighbor list), and waited until the destination entry expired (also tried deleting them in another test), but the ping works fine fairly quickly.. so I'm not really sure how to reproduce the problem yet (other than wait another day). If I wait a day and re-test, it seems to fail the first time only. I'll see if I can capture more data tomorrow with higher interval timer options and see if it works around the problem.

From test output a few days back..

+--------------------------------------------------------+--------------------------+----------+
| vMotion Enabled                                        | PASS                     |          |
+--------------------------------------------------------+--------------------------+----------+
| vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 192.168.1.23   | FAIL                     |          |
+--------------------------------------------------------+--------------------------+----------+
| vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 192.168.1.22   | FAIL                     |          |
+--------------------------------------------------------+--------------------------+----------+

Today - first attempt fails.. and second and subsequent tests work fine..

user% ssh root@192.168.100.100 'vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 192.168.1.23'
PING 192.168.1.23 (192.168.1.23): 8972 data bytes

--- 192.168.1.23 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

user% ssh root@192.168.100.100 'vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 192.168.1.23'
PING 192.168.1.23 (192.168.1.23): 8972 data bytes
8980 bytes from 192.168.1.23: icmp_seq=0 ttl=64 time=0.307 ms
8980 bytes from 192.168.1.23: icmp_seq=1 ttl=64 time=0.146 ms
8980 bytes from 192.168.1.23: icmp_seq=2 ttl=64 time=0.135 ms

--- 192.168.1.23 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.135/0.196/0.307 ms

Also tried with 10,000 packets and not a single packet lost.

avshukla commented 4 years ago

Hi Jim,

Yes we have noticed it too. We are looking at options to handle this condition and or modify the check so that we do not get a false alarm. This will be taken care in the next update of the script on github.

For now, if you hit this issue, please perform manual checks like you have done to isolate it further.

Regards, Avinash

From: Jim Crumpler notifications@github.com Reply-To: CiscoDevNet/Hyperflex-Hypercheck reply@reply.github.com Date: Wednesday, July 22, 2020 at 10:06 PM To: CiscoDevNet/Hyperflex-Hypercheck Hyperflex-Hypercheck@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [CiscoDevNet/Hyperflex-Hypercheck] vmkping check with -i doesn't work in rare situations (#18)

I'm seeing a similar problem on multiple hosts. The vmkping commands to test the vmotion network connectivity are intermittently failing, yet when run again they consistently pass. The problem can be reproduced outside of the test script (as above). The ARP theory sounds great (the interval timer is too short, and the ARP reply doesn't get back before the 3 packets at 0.05 seconds finish, hence fail) but I had a look at the ARP tables on the ESXi host (esxcli network ip neighbor list), and waited until the destination entry expired (also tried deleting them in another test), but the ping works fine fairly quickly.. so I'm not really sure how to reproduce the problem yet (other than wait another day). If I wait a day and re-test, it seems to fail the first time only. I'll see if I can capture more data tomorrow with higher interval timer options and see if it works around the problem.

From test output a few days back..

+--------------------------------------------------------+--------------------------+----------+

| vMotion Enabled | PASS | |

+--------------------------------------------------------+--------------------------+----------+

| vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 192.168.1.23 | FAIL | |

+--------------------------------------------------------+--------------------------+----------+

| vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 192.168.1.22 | FAIL | |

+--------------------------------------------------------+--------------------------+----------+

Today - first attempt fails.. and second and subsequent tests work fine..

user% ssh root@192.168.100.100 'vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 192.168.1.23'

PING 192.168.1.23 (192.168.1.23): 8972 data bytes

--- 192.168.1.23 ping statistics ---

3 packets transmitted, 0 packets received, 100% packet loss

user% ssh root@192.168.100.100 'vmkping -I vmk2 -c 3 -d -s 8972 -i 0.05 192.168.1.23'

PING 192.168.1.23 (192.168.1.23): 8972 data bytes

8980 bytes from 192.168.1.23: icmp_seq=0 ttl=64 time=0.307 ms

8980 bytes from 192.168.1.23: icmp_seq=1 ttl=64 time=0.146 ms

8980 bytes from 192.168.1.23: icmp_seq=2 ttl=64 time=0.135 ms

--- 192.168.1.23 ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max = 0.135/0.196/0.307 ms

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/CiscoDevNet/Hyperflex-Hypercheck/issues/18#issuecomment-662796957, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AENNTQPR5Z6PZE6EXCR54F3R46SIDANCNFSM4MHDULCA.

jgc234 commented 4 years ago

I left it over night and re-tested directly on the host with various options. The first few seconds fail, presumably waiting for ARP to populate on both ends.

The current total time is far too short (3 x 0.05 = 0.15 seconds). The whole test is finished and fails before the first packet is sent.

If you raise either the count or the interval timer it should fix the problem.

[root@host-hx-1:~] vmkping -I vmk2 -c 10 -d -s 8972 -i 1 192.168.1.23
PING 192.168.1.23 (192.168.1.23): 8972 data bytes
8980 bytes from 192.168.1.23: icmp_seq=2 ttl=64 time=0.371 ms      <-- seq=2, first 2 responses not received in time
8980 bytes from 192.168.1.23: icmp_seq=3 ttl=64 time=0.265 ms
8980 bytes from 192.168.1.23: icmp_seq=4 ttl=64 time=0.268 ms
[snipped]

--- 192.168.1.23 ping statistics ---
10 packets transmitted, 8 packets received, 20% packet loss
round-trip min/avg/max = 0.248/0.275/0.371 ms

Testing on another address with an interval of 0.1 and count of 5, it only just worked with 1 packet being sent within that time.

[root@host-hx-1:~] vmkping -I vmk2 -c 5 -d -s 8972 -i 0.1 192.168.1.22
PING 192.168.1.22 (192.168.1.22): 8972 data bytes
8980 bytes from 192.168.1.22: icmp_seq=4 ttl=64 time=0.372 ms    <- seq=4, first 4 packets missing at 0.1s each

--- 192.168.1.22 ping statistics ---
5 packets transmitted, 1 packets received, 80% packet loss
round-trip min/avg/max = 0.372/0.372/0.372 ms

An immediate re-test on the same address works fine.

[root@host-hx-1:~] vmkping -I vmk2 -c 5 -d -s 8972 -i 0.1 192.168.1.22
PING 192.168.1.22 (192.168.1.22): 8972 data bytes
8980 bytes from 192.168.1.22: icmp_seq=0 ttl=64 time=0.230 ms
8980 bytes from 192.168.1.22: icmp_seq=1 ttl=64 time=0.187 ms
8980 bytes from 192.168.1.22: icmp_seq=2 ttl=64 time=0.139 ms
8980 bytes from 192.168.1.22: icmp_seq=3 ttl=64 time=0.135 ms
8980 bytes from 192.168.1.22: icmp_seq=4 ttl=64 time=0.207 ms

--- 192.168.1.22 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.135/0.180/0.230 ms

Removing the interval (-i option) should fix it (that gives 3 seconds). Looking at the code, someone has commented that out previously and added the -i 0.05 option in, but unsure if that was to speed-up the test, or for some other reason.

Specifically, this is only for the vmotion test (I assume all other addresses are chatty), which is on line 1109 and was modified in commit d79ea32647e192214aebc7558efcc191218bbaff

hsardana09 commented 4 years ago

vmotion ping has been removed because of multiple false positives. We now only verify that vmotion is enabled.