Closed tgeoghegan closed 1 year ago
@hostirosti has observed this in the Google instance used for international ENPA. ISRG has observed this in the instance we run for US ENPA. When we detect this problem, we see that the last thing logged by a stuck facilitator is usually (always?) that it was doing a PUT to S3. We also observe that the stuck facilitator can still respond to metrics scrapes after the thread trying to write to S3 gets stuck, suggesting the process is still running, but the thread writing to S3 can't make progress.
We haven't caught an occurrence of this in an EKS cluster, but we can't be sure it never happens there.
We obtained some core dumps from stuck facilitator
processes. I've attached the backtraces from all threads to this comment. What we observe is thread #1 stuck under <facilitator::transport::s3::S3Transport as facilitator::transport::Transport>::put
, which in turn is stuck on tokio::runtime::Handle::block_on
(slightly different stack in the case of Robert's core, which is from a different facilitator
version).
The other threads seem to be Tokio workers waiting for work to do, but I'm not sure I'm reading the backtraces correctly. My feeble understanding of Tokio runtimes is that we expect there to be two threadpools in a Tokio runtime, one dedicated to well-behaved tasks that yield regularly (via .await
) and another dedicated to blocking operations (details. So it might be that threads 3 and 4 are the blocking and non-blocking Tokio thread pools.
There is a further thread with some crossbeam
frames in it. Cargo.lock
suggests that only slog-async
depends on crossbeam
so this could be a dedicated thread for logging.
stuck-rkubis-backtrace.txt stuck-us-ca-backtrace.txt stuck-us-wa-backtrace.txt .
I have two broad hypotheses on what could be wrong, and both could be true:
facilitator
and clone handles to it to everything that might need to do async work. Usually, that's the AWS API clients that use rusoto
, and we use Handle::block_on
to make that appear synchronous. We also use that runtime to run a warp
server to service metrics scrapes. I think it's possible that we are blocking a Tokio thread while doing something like obtaining GCP credentials, either while waiting on the network or while waiting on a mutex. I'm still not sure why that would cause a deadlock but it's worth looking at.Also, we don't tell Tokio how many threads to use when creating a runtime, so Tokio infers an appropriate thread count based on the number of available CPUs. In production, all our facilitator
containers run with resource limits of about 1.5 CPU, so my guess is that from within the container, it looks like there is only 1 core, and so we may end up with a one thread threadpool.
src/gcp_oauth.rs
. We use Arc<RwLock<>>
s in gcp_oauth
so that we can efficiently share GCP credentials across lots of API clients. Accessing AWS API from GKE requires GCP credentials, because you have to obtain an identity token for a service account before you can authenticate to an AWS OIDC identity provider. So maybe we have two threads both trying to get GCP credentials and deadlocking. However I'm not sure why there would ever be two threads contending for that lock given that the facilitator is largely single threaded.Finally, we're currently trying to flush this out with more logging. I've turned up the log level in our us-ca-apple-intake
worker group to debug
in hopes that if we see another container get stuck like this, the detailed logs will show us what it was doing. I also found that we weren't capturing logs from facilitator
's dependencies because slog
doesn't work out of the box with log
clients or tracing
. #1284 resolves that and if necessary we'll get that out to production to yield more info.
Interestingly the impact of this bug is mitigated by the fact that we restart facilitator
once an hour (so that it can notice new keys) and that Kubernetes autoscaling will frequently kill pods and spawn new ones. That's good for availability of the application, but bad for our ability to catch this bug. Besides turning up logging, we may wish to tune worker-maximum-lifetime
in a locality in an effort to reproduce the bug. If I'm right about it being related to obtaining GCP credentials while hitting S3, then you'd need to run for a bit more than an hour, because otherwise the credentials obtained by the PubSub client the first time it tries to pull a task will likely be valid for the entire lifetime of the process.
Interestingly the impact of this bug is mitigated by the fact that we restart facilitator once an hour (so that it can notice new keys) and that Kubernetes autoscaling will frequently kill pods and spawn new ones. That's good for availability of the application, but bad for our ability to catch this bug. Besides turning up logging, we may wish to tune worker-maximum-lifetime in a locality in an effort to reproduce the bug. If I'm right about it being related to obtaining GCP credentials while hitting S3, then you'd need to run for a bit more than an hour, because otherwise the credentials obtained by the PubSub client the first time it tries to pull a task will likely be valid for the entire lifetime of the process.
Is the hourly restart something that was added recently / with autoscaling?
Although autoscaling might mitigate / obscure the issue, if the deployment is never scaled to 0 and one of the older (earlier) pods get stuck it would stick around since k8s always stops the youngest pods first.
Is the hourly restart something that was added recently / with autoscaling?
Auto-restart was added recently -- it was added as part of the recent key rotation development work, and went out with facilitator version 0.6.19
.
Also, it seems that unfortunately the restart does not help with the deadlock issue: it is too "nice" in the sense that it only restarts jobs at the boundary of a work item, so deadlock within a work item keeps the restarts from occurring.
Ok that would mean that the pod sticks around and auto scaling just adding pods as the queue depth grows when pods get stuck.
We have occasionally observed that
facilitator
intake-batch
workers get stuck, inasmuch as they no longer do work, but are still running.Our current hypothesis is that there is a bug in our code that manages GCP access and/or identity tokens causing a deadlock.
This issue tracks finding and fixing the problem. I'm using it to compile observations and data about the bug.