edgelesssys / contrast

Deploy and manage confidential containers on Kubernetes
https://docs.edgeless.systems/contrast
GNU Affero General Public License v3.0
161 stars 6 forks source link

set: `rpc error: code = DeadlineExceeded desc = context deadline exceeded` #517

Closed blenessy closed 3 weeks ago

blenessy commented 1 month ago

I'm walking through the Confidential emoji voting example.

I got the following error a couple of times:

root@485f81794475:/work/work# contrast set -c "${coordinator}:1313" deployment/
❌Waiting for coordinator ............................................................
Error: setting manifest: rpc error: code = DeadlineExceeded desc = context deadline exceeded
root@485f81794475:/work/work# contrast set -c "${coordinator}:1313" deployment/
❌Waiting for coordinator ............................................................
Error: setting manifest: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Approx 5 minute later it worked:

contrast set -c "${coordinator}:1313" deployment/ --log-level debug
time=2024-06-01T09:21:45.106Z level=DEBUG msg="Workload owner keys in manifest" keys=[33d3982689bbbe32cde612d863ec799abc35ff469e4dd7a97dfe8848762a7fd7]
time=2024-06-01T09:21:45.384Z level=DEBUG msg="Using KDS cache dir" dir=/root/.cache/contrast/kds
  Waiting for coordinator ....time=2024-06-01T09:21:47.629Z level=INFO msg="Validate called" snp-validator.nonce=4bdba85d3ee1a6ade3dd91039b30dca86e5e0c8a21eb05679bc5b4a84e957fe1
time=2024-06-01T09:21:47.636Z level=INFO msg="Report decoded" snp-validator.reportRaw=02000000020000001f000300000000000100000000000000000000000000000002000000000000000000000000000000000000000100000003000000000008d20100000000000000000000000000000063985e1bda3d86ce3444d98e52dc4a54bd2f778da9111b478d0933e0bdfa69bb8df671b272fd347ef07da1c0a5a8050d66ab9d0ccf798f0ca5d6c21f0672b4ac05cccd97d1eab6bf36497a6685753a043d0a91524c2825f4fb059da787823a7f9ac6263302dd299a9f9c4d88b6527bd650480a5051fd44420f86810b0adac015e34b8448e8f33d912a2b048477ef3ef9230a4b43ce484727614d671295d4c73cd39016429e9b43e53f86ddc8607512f0a5a9ca13153602b0c52b94f32761a8bc000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000f77cec163b5eb6c6368d26fc422cb382eb8c0c45dfde9dcbd716cda504894b08ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff0300000000000873000000000000000000000000000000000000000000000000bbe5ee14456a83902b1e29527b160ced1d486db2be35f9cdb25293fd6bdeed2a4dd8d26310b610fba27fcf62c856b859c5639cb66fcc9dee0ba945e97363f7ae03000000000008730434010004340100030000000000087300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000021ed2af592bc6e723b682899df60fe1ac9acf76c5578ec9c6a886c5a7fad382086cbe69ab72f17b524a33207949165300000000000000000000000000000000000000000000000003833ab328263daa979b3ac9708c47d56e8d50f6379737de63085a8e746971ee17b9c740b8af8c2199001949d17d8961c0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
time=2024-06-01T09:21:47.661Z level=DEBUG msg="file does not exist" kds-cache.file=sha256-3abf69d00aa05c2e5b73a3eb2835c66bfcf56ee0ddb0e2dd746d252e165791f1
time=2024-06-01T09:21:47.661Z level=DEBUG msg="Get not cached" kds-getter.url=https://kdsintf.amd.com/vcek/v1/Milan/crl
..time=2024-06-01T09:21:48.454Z level=INFO msg="Successfully verified report signature"
time=2024-06-01T09:21:48.456Z level=INFO msg="Successfully validated report data"
time=2024-06-01T09:21:48.457Z level=INFO msg="Validate finished successfully"
✔️ Connected to coordinator
✔️ Manifest set successfully

Coordinator logs:

root@485f81794475:/work/work# kubectl logs coordinator-5b8748d946-hlhb6
time=2024-06-01T08:57:26.336Z level=INFO msg="Logger initialized" level=INFO
time=2024-06-01T08:57:26.336Z level=INFO msg="Coordinator started"
time=2024-06-01T08:57:26.347Z level=INFO msg="Coordinator mesh API listening"
time=2024-06-01T08:57:26.355Z level=INFO msg="Coordinator user API listening"
time=2024-06-01T09:21:48.508Z level=INFO msg="SetManifest called"
time=2024-06-01T09:21:48.513Z level=INFO msg="SetManifest succeeded"

What is weird is that the HTTPS requests take >2 seconds.

$ time curl -kv https://108.141.184.236:1313
*   Trying 108.141.184.236:1313...
* Connected to 108.141.184.236 (108.141.184.236) port 1313
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Request CERT (13):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Certificate (11):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256 / [blank] / UNDEF
* ALPN: server did not agree on a protocol. Uses default.
* Server certificate:
*  subject: CN=Constellation
*  start date: Jun  1 07:30:37 2024 GMT
*  expire date: Jun  1 11:30:37 2024 GMT
*  issuer: CN=Constellation
*  SSL certificate verify result: self signed certificate (18), continuing anyway.
* using HTTP/1.x
> GET / HTTP/1.1
> Host: 108.141.184.236:1313
> User-Agent: curl/8.6.0
> Accept: */*
> 
* Received HTTP/0.9 when not allowed
* Closing connection
curl: (1) Received HTTP/0.9 when not allowed
curl -kv https://108.141.184.236:1313  0.01s user 0.01s system 0% cpu 2.219 total

Please note that I followed the Getting Started guide to the letter. The only difference is that I am using a different location:

azLocation="westeurope"
burgerdev commented 4 weeks ago

Hi @blenessy,

Thanks for reporting this issue!

Before we jump into the exceeded deadline, I'd like to point out that the TLS handshake is expected to take longer than usual, because of the attestation work necessary for our aTLS. The Coordinator gathers attestation evidence and presents it in the TLS server auth handshake part. I'll take a look if it actually has to be that long, but it is most likely not the source of the problem here.

The symptom of a deadline exceeded that resolves itself after some time sounds like a load balancer not being ready. I just measured the time it took from LB IP assignment to traffic being routed and arrived at around 15s, which is on the order of magnitude for our TLS handshake timeout (30s). On the other hand, I would only expect this to be a problem when scripted, not when pasting from the tutorial.

There might have been an issue with the evidence gathering, though. It would be interesting to observe this issue with debug logging on - you'd need to skip this step

kubectl apply -f https://github.com/edgelesssys/contrast/releases/latest/download/coordinator.yml

and instead

  1. Download the coordinator.yml file into the deployments folder.
  2. Add environment variables CONTRAST_LOG_LEVEL=debug and CONTRAST_LOG_SUBSYSTEMS=*) to the coordinator container.
  3. contrast generate deployment/
  4. kubectl apply -f deployment/coordinator.yml
  5. Contine with contrast set ....

Cheers, Markus

burgerdev commented 4 weeks ago

Sorry, I forgot to mention one detail: if you modify the coordinator resource like I suggested, its runtime policy is going to change, and the CLI will write something like this:

✔️ Generated workload policy annotations
✔️ Updated manifest manifest.json
time=2024-06-05T08:54:55.679+02:00 level=WARN msg="Found unexpected coordinator policy" name=coordinator hash=d3b11ada658908741a6afd0b2f05145749766d7802f4d782531f6f5c62a555a1

The unexpected hash needs to be passed to contrast set and contrast verify, otherwise they will reject the coordinator.

contrast set --coordinator-policy-hash=d3b11ada658908741a6afd0b2f05145749766d7802f4d782531f6f5c62a555a1 deployment/
blenessy commented 3 weeks ago

Thanks for the clear comments/instructions @burgerdev it allowed me to reproduce the problem cleanly and come up with the fix too (#544).

tl;dr: After deploying the coordinator and we know the IP, it can take ~70s until the coordinator is reachable.

To build intuition - here are the progress dots (each dot is .5 seconds) from 15 tests (with #544):

................................................................
..........................................
..................................................................
................................................................
................................................................
................................................................
................
..........................................................................................................................................
....................................................................
................................................................
................................................................
......
......................................................................................................
....................
..................................................................................

The median is around 30 seconds, but the deviation is big.

blenessy commented 2 weeks ago

FWIW. The 180s delay in #544 was needed today when I tested deployment to CentralIndia:

contrast set -c "${coordinator}:1313" deployment/
  Waiting for coordinator .....................................................................................................................................................................................................................................
✔️ Connected to coordinator
✔️ Manifest set successfully

(229 dots == 114.5 seconds)