Closed blenessy closed 3 weeks ago
Thanks for the PR, @blenessy! I reproduced the issue using your script, and increasing the timeout sounds like a good idea at least for a mitigation. The root cause seems to be in Azure's LB rule propagation - changing the script to not delete the service allowed the coordinator to connect in a few seconds.
Did a bunch of tests and it often takes >30s before the coordinator is routable after a fresh deployment (and valid public IP).
Best to set a conservative timeout here otherwise non-interactive use-cases will bump into this for sure.
This fixes #517.
Testing
Tested in
westeurope
. Tested hammering with this script (based on burgerdev's instructions.Attaching part (15 iterations) of the test log for more details. You can clearly see that the time distribution until the coordination gets ready is big. test.log