edgelesssys / contrast

Deploy and manage confidential containers on Kubernetes
https://docs.edgeless.systems/contrast
GNU Affero General Public License v3.0
161 stars 6 forks source link

cli: wait 180s for the coordinator on `contrast set` #544

Closed blenessy closed 3 weeks ago

blenessy commented 3 weeks ago

Did a bunch of tests and it often takes >30s before the coordinator is routable after a fresh deployment (and valid public IP).

Best to set a conservative timeout here otherwise non-interactive use-cases will bump into this for sure.

This fixes #517.

Testing

Tested in westeurope. Tested hammering with this script (based on burgerdev's instructions.

#!/bin/sh

set -ex

mkdir -p deployment

for _ in $(seq 20); do
    # clean up previous iteration
    kubectl delete -f deployment/coordinator.yml || :
    # remove state created by contrast
    rm -rf -- verify *.sha256 *.json *.pem *.rego deployment/*
    cp coordinator.yml deployment/
    contrast --log-level debug generate deployment/
    kubectl apply -f deployment/coordinator.yml
    while sleep 1; do # wait for IP
        coordinator=$(kubectl get svc coordinator -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
        [ "$coordinator" = "" ] || break
    done
    contrast --log-level debug set \
        --coordinator-policy-hash=443e4c9cd765fb8535c485ca9392c47b0135b47d3ab633062760e25499231108 \
        -c "${coordinator:?}:1313" deployment/
done

Attaching part (15 iterations) of the test log for more details. You can clearly see that the time distribution until the coordination gets ready is big. test.log

burgerdev commented 3 weeks ago

Thanks for the PR, @blenessy! I reproduced the issue using your script, and increasing the timeout sounds like a good idea at least for a mitigation. The root cause seems to be in Azure's LB rule propagation - changing the script to not delete the service allowed the coordinator to connect in a few seconds.