It seems the ARC controller is struggling to scale effectively under higher workloads. To avoid delving too deep into the technical details, here's a brief overview of the issues I've identified:
Lack of Concurrency Handling: The controller does not implement concurrency, which leads to diminishing performance as the volume of events increases. As the system processes more events, the filtering process becomes less efficient, eventually causing a bottleneck.
Suboptimal Queue/Requeue Logic: The current queueing mechanism seems to be based on assumptions about system behavior, which leads to overuse. As volume grows, events are requeued multiple times, preventing them from being fully processed until they've gone through the entire queue several times, significantly delaying completion.
Inaccurate Deletion Logic: The controller inaccurately tracks "running" runners. Some runners that have already completed are still considered "running" because the controller hasn't processed all events. This creates both logical and metric inconsistencies.
While the controller works well under lower loads, once we hit a range of 250-500 runners, we see performance degrade, with runners being consumed faster than they can be created. The only immediate solution has been to increase the minRunners threshold to 1000+, which buffers some demand but is far from optimal and is also quite costly.
I've attached a log stacktrace for a single runner from start to finish. As you can see, there are significant delays in event processing (labeled as "delta"), which directly align with the issues I mentioned. These delays indicate a drag in the controller's performance as it processes each event:
2024-11-11T01:40:11.951100251Z Delta: 0s Creating new ephemeral runner registration and updating status with runner config
2024-11-11T01:40:11.951104971Z Delta: 0s Creating ephemeral runner JIT config
2024-11-11T01:40:12.345764305Z Delta: 0.394659s Created ephemeral runner JIT config
2024-11-11T01:40:12.345781359Z Delta: 0.000017s Updating ephemeral runner status with runnerId and runnerJITConfig
2024-11-11T01:40:12.366573082Z Delta: 0.020792s Updated ephemeral runner status with runnerId and runnerJITConfig
2024-11-11T01:42:47.089272009Z Delta: 2m34.722698s Creating new ephemeral runner secret for jitconfig.
2024-11-11T01:42:47.089280789Z Delta: 0s Creating new secret for ephemeral runner
2024-11-11T01:42:47.089283987Z Delta: 0.000003s Created new secret spec for ephemeral runner
2024-11-11T01:42:47.143239979Z Delta: 0.053957s Created ephemeral runner secret
2024-11-11T01:45:01.039514236Z Delta: 2m13.896274s Creating new EphemeralRunner pod.
2024-11-11T01:45:01.039521723Z Delta: 0s Creating new pod for ephemeral runner
2024-11-11T01:45:01.039826539Z Delta: 0.000305s Created new pod spec for ephemeral runner
2024-11-11T01:45:01.065468997Z Delta: 0.025642s Created ephemeral runner pod
2024-11-11T01:47:07.982516769Z Delta: 2m6.917047s Ephemeral runner container is still running
2024-11-11T01:47:07.982523164Z Delta: 0s Updating ephemeral runner status with pod phase
2024-11-11T01:47:07.997171886Z Delta: 0.014648s Updated ephemeral runner status with pod phase
2024-11-11T01:49:13.458332326Z Delta: 2m5.461160s Ephemeral runner container is still running
2024-11-11T02:26:17.150591530Z Delta: 37m3.692259s Ephemeral runner container is still running
2024-11-11T02:28:22.241367217Z Delta: 2m5.090776s Checking if runner exists in GitHub service
2024-11-11T02:28:22.440976881Z Delta: 0.199609s Runner does not exist in GitHub service
2024-11-11T02:28:22.441021853Z Delta: 0.000045s Ephemeral runner has finished since it does not exist in the service anymore
2024-11-11T02:28:22.441026966Z Delta: 0.000005s Updating ephemeral runner status to Finished
2024-11-11T02:28:22.456410443Z Delta: 0.015384s EphemeralRunner status is marked as Finished
2024-11-11T02:28:25.554531254Z Delta: 3.098121s Deleting finished ephemeral runner
2024-11-11T02:30:00.773093367Z Delta: 1m35.218562s Successfully removed runner registration finalizer
2024-11-11T02:31:08.168874157Z Delta: 1m7.395780s Finalizing ephemeral runner
2024-11-11T02:31:08.168893247Z Delta: 0.000019s Cleaning up the runner pod
2024-11-11T02:31:08.168954381Z Delta: 0.000061s Deleting the runner pod
2024-11-11T02:31:08.179298790Z Delta: 0.010344s Waiting for ephemeral runner owned resources to be deleted
2024-11-11T02:32:31.568016529Z Delta: 1m22.388718s Finalizing ephemeral runner
2024-11-11T02:32:31.568021471Z Delta: 0s Cleaning up the runner pod
2024-11-11T02:32:31.568024202Z Delta: 0.000003s Pod is deleted
2024-11-11T02:32:31.568026535Z Delta: 0.000002s Cleaning up the runner jitconfig secret
2024-11-11T02:32:31.613108938Z Delta: 0.045083s Deleting the jitconfig secret
2024-11-11T02:32:31.766720793Z Delta: 0.153611s Waiting for ephemeral runner owned resources to be deleted
2024-11-11T02:34:06.661574344Z Delta: 1m34.894854s Finalizing ephemeral runner
2024-11-11T02:34:06.661595024Z Delta: 0s Cleaning up the runner pod
2024-11-11T02:34:06.661613921Z Delta: 0.000019s Pod is deleted
2024-11-11T02:34:06.661616755Z Delta: 0.000003s Cleaning up the runner jitconfig secret
2024-11-11T02:34:06.664697120Z Delta: 0.003082s Secret is deleted
2024-11-11T02:34:06.664704343Z Delta: 0.000007s Cleaning up runner linked pods
2024-11-11T02:34:06.665456240Z Delta: 0.000749s Runner-linked pods are deleted
2024-11-11T02:34:06.665475678Z Delta: 0.000019s Cleaning up runner linked secrets
2024-11-11T02:34:06.717546247Z Delta: 0.052070s Runner-linked secrets are deleted
2024-11-11T02:34:06.717563460Z Delta: 0.000017s Removing finalizer
2024-11-11T02:34:06.741272493Z Delta: 0.023709s Successfully removed finalizer after cleanup
Even ignoring the deletion process (which may contribute to the queue depth), we can see significant delays—often 5+ minutes between events—such as pod creation, which is particularly exacerbated during peak load hours.
As you can see, while increasing minRunners provides some relief, it's not a long-term solution. We'll need to optimize the controller's concurrency, queue handling, and deletion logic to address these issues more effectively.
Within our PR we enabled concurrency and saw immediate improvements to the performance and scalability. I still believe there is some severely wrong logic in requeue and finishing a queued event, however, this alone helps get us out of the position we were finding ourselves in currently.
This was tested in our PROD environment at approx ~16:14 we deployed our patched version, and within 5+ minutes the P99 job latency metrics immediately began to come down to acceptable levels.
Description
It seems the ARC controller is struggling to scale effectively under higher workloads. To avoid delving too deep into the technical details, here's a brief overview of the issues I've identified:
Lack of Concurrency Handling: The controller does not implement concurrency, which leads to diminishing performance as the volume of events increases. As the system processes more events, the filtering process becomes less efficient, eventually causing a bottleneck.
Suboptimal Queue/Requeue Logic: The current queueing mechanism seems to be based on assumptions about system behavior, which leads to overuse. As volume grows, events are requeued multiple times, preventing them from being fully processed until they've gone through the entire queue several times, significantly delaying completion. Inaccurate Deletion Logic: The controller inaccurately tracks "running" runners. Some runners that have already completed are still considered "running" because the controller hasn't processed all events. This creates both logical and metric inconsistencies.
While the controller works well under lower loads, once we hit a range of 250-500 runners, we see performance degrade, with runners being consumed faster than they can be created. The only immediate solution has been to increase the
minRunners
threshold to 1000+, which buffers some demand but is far from optimal and is also quite costly. I've attached a log stacktrace for a single runner from start to finish. As you can see, there are significant delays in event processing (labeled as "delta"), which directly align with the issues I mentioned. These delays indicate a drag in the controller's performance as it processes each event:Even ignoring the deletion process (which may contribute to the queue depth), we can see significant delays—often 5+ minutes between events—such as pod creation, which is particularly exacerbated during peak load hours.
As you can see, while increasing
minRunners
provides some relief, it's not a long-term solution. We'll need to optimize the controller's concurrency, queue handling, and deletion logic to address these issues more effectively.Within our PR we enabled concurrency and saw immediate improvements to the performance and scalability. I still believe there is some severely wrong logic in requeue and finishing a queued event, however, this alone helps get us out of the position we were finding ourselves in currently.
This was tested in our PROD environment at approx ~16:14 we deployed our patched version, and within 5+ minutes the P99 job latency metrics immediately began to come down to acceptable levels.