divviup / janus

Experimental implementation of the Distributed Aggregation Protocol (DAP) specification.
Mozilla Public License 2.0
53 stars 15 forks source link

Bad latency polling futures when processing expensive aggregation jobs #2955

Closed divergentdave closed 5 months ago

divergentdave commented 7 months ago

I'm seeing an aggregator process occasionally get rebooted by Kubernetes because liveness probe requests are timing out.

Liveness probe failed: Get "http://10.64.4.17:8000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This is happening when sending one aggregation job every ~10 seconds on average, (coming from four leader replicas) each containing 20 reports, using Prio3SumVecField64MultiproofHmacSha256Aes128 with a length of 100,000. Average aggregation job request latency is 1.75s. CPU usage is around 10%-20%, and CFS is not throttling the process.

The port 8000 server runs in its own Tokio task, and the /healthz endpoint handler returns immediately. One possible cause of this issue is one or two large aggregation job requests monopolizing executor threads, and keeping the port 8000 server task from being woken up in time to service its request. This is running on a VM with a single core (two hyperthreads), so it wouldn't take many slow-polling futures to starve the occasional liveness probe request. Profiling or tracing should indicate if this is happening due to VDAF preparation, copying, serialization, or deserialization. If the issue is just a small handful of VDAF operations, that should be straightforward to put behind spawn_blocking(), but if the issue is pervasive memcpy()s, that would be harder to fix. Note that Tokio has optional internal runtime metrics, including a histogram of poll durations, but they are unstable and behind a --cfg flag for now.

divergentdave commented 7 months ago

I ran the same setup locally, and I saw that latency of curl -i http://127.0.0.1:8000/healthz was typically 4ms, but tail latency was as bad as 0.5s-0.8s. Since my computer has four cores instead of one, I would expect the latency impact to be less bad than on the small cloud VM. I ran this under perf, and I got that 82.9% of samples were during PingPongTopology::helper_initialized(). Looking at the timeline in Hotspot, it seems like these aggregation job initialization requests are taking around half a second. (noticeably faster, due to CPU speed and network latency differences) Outside of this helper_initialized() call, I do see a couple percent more going to things like XOF evaluation and "bitvector decoding", inside PingPongTransition::evaluate(), the rest is stack traces that couldn't be recovered.

Based on this, moving selected expensive calls inside spawn_blocking() or a rayon thread pool should be a good mitigation. This may introduce some overhead of its own, but we can iterate on that if it's an issue.

divergentdave commented 5 months ago

Fixed in #2960.