This is a simple proposal to fix the vmotion ping failure problem. The first few vmotion pings often fail on a cluster node that has been idle. If you run the HXTool.py script again, it works fine. Assuming it takes a moment to populate the ARP cache, the aggressive nature of the timeout and the short count on the ping may cause a packet or two to be dropped.. The parser that checks the ping response expects 0% packet loss, hence it fails the test. Rather than re-write the pingstatus() parser, it's easier to just run the test twice - once with a normal timeout and normal MTU to populate the ARP cache, then run the existing test. That being said, I haven't conclusively proven it is an ARP problem - esxcli network ip neighbor list doesn't show the vmotion interfaces or addresses for me (even in the default stack). The other potential solution is to just remove the excessively low timeout, but I assume there was a reason for this (maybe to prune out long WAN links?)
This is a simple proposal to fix the vmotion ping failure problem. The first few vmotion pings often fail on a cluster node that has been idle. If you run the HXTool.py script again, it works fine. Assuming it takes a moment to populate the ARP cache, the aggressive nature of the timeout and the short count on the ping may cause a packet or two to be dropped.. The parser that checks the ping response expects 0% packet loss, hence it fails the test. Rather than re-write the pingstatus() parser, it's easier to just run the test twice - once with a normal timeout and normal MTU to populate the ARP cache, then run the existing test. That being said, I haven't conclusively proven it is an ARP problem -
esxcli network ip neighbor list
doesn't show the vmotion interfaces or addresses for me (even in the default stack). The other potential solution is to just remove the excessively low timeout, but I assume there was a reason for this (maybe to prune out long WAN links?)