Closed blenessy closed 3 weeks ago
Hi @blenessy,
Thanks for reporting this issue!
Before we jump into the exceeded deadline, I'd like to point out that the TLS handshake is expected to take longer than usual, because of the attestation work necessary for our aTLS. The Coordinator gathers attestation evidence and presents it in the TLS server auth handshake part. I'll take a look if it actually has to be that long, but it is most likely not the source of the problem here.
The symptom of a deadline exceeded that resolves itself after some time sounds like a load balancer not being ready. I just measured the time it took from LB IP assignment to traffic being routed and arrived at around 15s, which is on the order of magnitude for our TLS handshake timeout (30s). On the other hand, I would only expect this to be a problem when scripted, not when pasting from the tutorial.
There might have been an issue with the evidence gathering, though. It would be interesting to observe this issue with debug logging on - you'd need to skip this step
kubectl apply -f https://github.com/edgelesssys/contrast/releases/latest/download/coordinator.yml
and instead
coordinator.yml
file into the deployments folder.CONTRAST_LOG_LEVEL=debug
and CONTRAST_LOG_SUBSYSTEMS=*
) to the coordinator container.contrast generate deployment/
kubectl apply -f deployment/coordinator.yml
contrast set ...
.Cheers, Markus
Sorry, I forgot to mention one detail: if you modify the coordinator resource like I suggested, its runtime policy is going to change, and the CLI will write something like this:
✔️ Generated workload policy annotations
✔️ Updated manifest manifest.json
time=2024-06-05T08:54:55.679+02:00 level=WARN msg="Found unexpected coordinator policy" name=coordinator hash=d3b11ada658908741a6afd0b2f05145749766d7802f4d782531f6f5c62a555a1
The unexpected hash needs to be passed to contrast set
and contrast verify
, otherwise they will reject the coordinator.
contrast set --coordinator-policy-hash=d3b11ada658908741a6afd0b2f05145749766d7802f4d782531f6f5c62a555a1 deployment/
Thanks for the clear comments/instructions @burgerdev it allowed me to reproduce the problem cleanly and come up with the fix too (#544).
tl;dr: After deploying the coordinator and we know the IP, it can take ~70s until the coordinator is reachable.
To build intuition - here are the progress dots (each dot is .5 seconds) from 15 tests (with #544):
................................................................
..........................................
..................................................................
................................................................
................................................................
................................................................
................
..........................................................................................................................................
....................................................................
................................................................
................................................................
......
......................................................................................................
....................
..................................................................................
The median is around 30 seconds, but the deviation is big.
FWIW. The 180s delay in #544 was needed today when I tested deployment to CentralIndia
:
contrast set -c "${coordinator}:1313" deployment/
Waiting for coordinator .....................................................................................................................................................................................................................................
✔️ Connected to coordinator
✔️ Manifest set successfully
(229 dots == 114.5 seconds)
I'm walking through the Confidential emoji voting example.
I got the following error a couple of times:
Approx 5 minute later it worked:
Coordinator logs:
What is weird is that the HTTPS requests take >2 seconds.
Please note that I followed the Getting Started guide to the letter. The only difference is that I am using a different location: