`Initialize containers` step consistently takes around ~14 minutes to complete (no image pull issues)

Hey there,

Just looking for a bit of a pointer here, and to ask whether you've seen this behaviour before.

One of my customers is an early adopter of EKS-Anywhere (v1.23.15-eks-69f0cbf with CRI-O + containerd), currently on vSphere 7.x, and they have containerMode: kubernetes somewhat working using a RunnerDeployment on the current releast of ARC (v0.27.4), the runner is v2.304.0 runner image. Hooks is 0.3.2. The job container image is ubi8/ubi-minimal:8.8 The "somewhat" part is that every job, no matter the size of the job container, takes almost exactly 14 minutes to get past the Initialize containers step within the workflow, but then does go on to succeed very swiftly (just a "hello world" step next). It's perhaps prudent to note that this is their first attempt at using ARC on this cluster at all; it's still a PoC for them.

The RunnerDeployment's imagePullPolicy is IfNotPresent, and in the case where it's not, I've debugged this on the nodes themselves, and I see in tcpdump that a connection to their Artifact store isn't actually attempted until ~14 minutes elapses. In the case where the image is already present, of course no pull is attempted, but the same duration elapses before the job ultimately succeeds.

The (two) k8s worker nodes have previously been 2 vCPU and 8 GB, so I bumped resources to 8 vCPU and 32GB, and there has been no change in outcome.. I still see no evidence on the nodes or otherwise that a resource shortfall is in play here.

The CSI I am using is NetApp Trident, and PVCs get an PV and go bound within a few seconds - no issues there.

Here is a redacted shot of a typical workflow. I can confirm that the image - pulled manually on the worker nodes - does indeed pull down within a few seconds (time crictl pull ...), so I am not convinced this is a network issue.

The screenshot doesn't show it, but line 10 (##[debug]/home/runner/externals/node16/bin/node /runner/k8s/index.js) is where the 14 mins is spent - see further for the raw log counterpart which does show it.

f chrome_UeDKOZxTEf

Here is the raw log excerpt which shows the ~14m delay between the stages:

2023-06-04T12:41:59.4349638Z Complete job name: initial-test-job
2023-06-04T12:41:59.4371584Z ##[debug]Collect running processes for tracking orphan processes.
2023-06-04T12:41:59.4577813Z ##[debug]Finishing: Set up job
2023-06-04T12:41:59.4866818Z ##[debug]Evaluating condition for step: 'Initialize containers'
2023-06-04T12:41:59.4915702Z ##[debug]Evaluating: success()
2023-06-04T12:41:59.4918032Z ##[debug]Evaluating success:
2023-06-04T12:41:59.4934400Z ##[debug]=> true
2023-06-04T12:41:59.4938501Z ##[debug]Result: true
2023-06-04T12:41:59.4964191Z ##[debug]Starting: Initialize containers
2023-06-04T12:41:59.5069467Z ##[debug]Register post job cleanup for stopping/deleting containers.
2023-06-04T12:41:59.6496794Z ##[group]Run '/runner/k8s/index.js'
2023-06-04T12:41:59.6500884Z shell: /home/runner/externals/node16/bin/node {0}
2023-06-04T12:41:59.6501243Z ##[endgroup]
2023-06-04T12:41:59.8054336Z ##[debug]/home/runner/externals/node16/bin/node /runner/k8s/index.js

< --- note timestamps - all the time spent here --- >

2023-06-04T12:55:23.4137069Z ##[debug]Using image '[XXXXXXXX.jfrog.io/docker/ubi8/ubi:8.8](http://xxxxxxxx.jfrog.io/docker/ubi8/ubi:8.8)' for job image
2023-06-04T12:55:23.4546632Z ##[debug]Job pod created, waiting for it to come online actions-runner-XXXXXX-actions-runner-deployment-fs-workflow
2023-06-04T12:55:54.5558565Z ##[debug]Job pod is ready for traffic
...

The processes on the runner themselves are sleeping during this time, and strace (outpt not included for brevity's sake) shows a typical non-blocking I/O cycle of sleep, wake, check futex, yield and go back to sleep. See runner processes sleeping - a state they remain for almost the entire 14m period:

runner@actions-runner-XXXXXX-actions-runner-deployment-lhbgp-2xxrg:/$ ps auxww
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner         1  0.0  0.0   4492  3260 ?        Ss   13:51   0:00 /bin/bash /usr/bin/entrypoint.sh
runner         7  0.0  0.0    224     4 ?        S    13:51   0:00 dumb-init bash
runner        10  0.0  0.0   4360  3288 ?        Ss   13:51   0:00 bash
runner        11  0.0  0.0   4360  3316 ?        S    13:51   0:00 /bin/bash ./run.sh
runner        99  0.0  0.0   4360  3372 ?        S    13:51   0:00 /bin/bash /runner/run-helper.sh
runner       103  0.4  0.3 13753912 101952 ?     Sl   13:51   0:02 /runner/bin/Runner.Listener run
runner       120  0.7  0.3 13750300 107344 ?     Sl   13:51   0:04 /runner/bin/Runner.Worker spawnclient 106 109
runner       158  1.0  0.2 928324 90280 ?        Sl   13:52   0:06 /runner/externals/node16/bin/node /runner/k8s/index.js
runner       235  0.3  0.0   4624  3776 pts/0    Ss   14:03   0:00 bash
runner       242  0.0  0.0   7064  1612 pts/0    R+   14:03   0:00 ps auxww
runner@actions-runner-XXXXXX-actions-runner-deployment-lhbgp-2xxrg:/$

As mentioned, as soon as the 14 min has elapsed, the workflow pod comes up within seconds, and the job moves on to successful completion.

Thanks for any pointers or ideas!

Extra context in case it''s helpful.

The below run is not the same run as my initial post.. but all exhibit the same output anyway.

During the "wait" period, the runner logs are effectively spinning on the following:

[WORKER 2023-06-04 15:11:59Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2023-06-04 15:11:59Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2023-06-04 15:11:59Z INFO HostContext] Well known directory 'Work': '/runner/_work'
[RUNNER 2023-06-04 15:12:02Z INFO JobDispatcher] Successfully renew job request 59, job is valid till 06/04/2023 15:22:02
[WORKER 2023-06-04 15:12:09Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2023-06-04 15:12:09Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2023-06-04 15:12:09Z INFO HostContext] Well known directory 'Work': '/runner/_work'
[WORKER 2023-06-04 15:12:19Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2023-06-04 15:12:19Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2023-06-04 15:12:19Z INFO HostContext] Well known directory 'Work': '/runner/_work'
[WORKER 2023-06-04 15:12:29Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2023-06-04 15:12:29Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2023-06-04 15:12:29Z INFO HostContext] Well known directory 'Work': '/runner/_work'
[WORKER 2023-06-04 15:12:39Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
...

.. before the time is up and the rest of the process continues:

...
[WORKER 2023-06-04 15:15:29Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2023-06-04 15:15:29Z INFO HostContext] Well known directory 'Work': '/runner/_work'
[WORKER 2023-06-04 15:15:31Z INFO JobServerQueue] Try to append 1 batches web console lines for record '1ab5e0cd-4ace-4d35-b32b-2935204705d0', success rate: 1/1.
[WORKER 2023-06-04 15:15:34Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[WORKER 2023-06-04 15:15:34Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[WORKER 2023-06-04 15:15:34Z INFO ProcessInvokerWrapper] Finished process 130 with exit code 0, and elapsed time 00:13:45.5155178.
[WORKER 2023-06-04 15:15:34Z INFO JobServerQueue] Try to append 1 batches web console lines for record '1ab5e0cd-4ace-4d35-b32b-2935204705d0', success rate: 1/1.
[WORKER 2023-06-04 15:15:34Z INFO ContainerHookManager] Response file for the hook script at '/runner/k8s/index.js' running command 'PrepareJob' was processed successfully
[WORKER 2023-06-04 15:15:34Z INFO ContainerHookManager] Response file for the hook script at '/runner/k8s/index.js' running command 'PrepareJob' was deleted successfully
[WORKER 2023-06-04 15:15:34Z INFO ContainerHookManager] Global variable 'ContainerHookState' updated successfully for 'PrepareJob' with data found in 'state' property of the response file.
[WORKER 2023-06-04 15:15:34Z INFO StepsRunner] Step result:
[WORKER 2023-06-04 15:15:34Z INFO ExecutionContext] Publish step telemetry for current step {
[WORKER 2023-06-04 15:15:34Z INFO ExecutionContext]   "action": "Pre Job Hook",
...

Hey @cloudbustinguk

Can you please post the output of kubectl describe on the job pod? I would imagine image pull not causing issues since it is done by the kubelet. From the debug log, what I have noticed is that the alpine check has exit code 1. We need that check to know if we need to mount node for alpine. That is something causing this issue. I'll try reproducing it using this image, but in the meantime, maybe kubectl describe can be helpful to see what stages did the job pod go through

Hey @cloudbustinguk

Can you please post the output of kubectl describe on the job pod? I would imagine image pull not causing issues since it is done by the kubelet. From the debug log, what I have noticed is that the alpine check has exit code 1. We need that check to know if we need to mount node for alpine. That is something causing this issue. I'll try reproducing it using this image, but in the meantime, maybe kubectl describe can be helpful to see what stages did the job pod go through

Hi @nikola-jokic ,

Many thanks for getting back to me. I include below both the runner pod and the eventual (after ~14m) workload/job pod.

Please let me know if you need anything else.

Cheers

Runner Pod

Name:         actions-runner-XXXXXX-actions-runner-deployment-7hjrn-dwhk9
Namespace:    github-actions-runners-eksa-XXXXXX
Priority:     0
Node:         eks-a-node-1-workload1-md-0-85884886c9-q9j6q.dev.YYYYYYYY.com/10.203.85.80
Start Time:   Tue, 13 Jun 2023 16:59:39 +0000
Labels:       actions-runner=
      actions-runner-controller/inject-registration-token=true
      app.kubernetes.io/instance=actions-runner-XXXXXX
      app.kubernetes.io/name=actions-runner-deployment
      pod-template-hash=9fd7b8dc
      runner-deployment-name=actions-runner-XXXXXX-actions-runner-deployment
      runner-template-hash=5c57dd9bfc
Annotations:  actions-runner/github-api-creds-secret: actions-runner-XXXXXX
      actions-runner/id: 106
      sync-time: 2023-06-13T16:59:35Z
Status:       Running
IP:           192.168.2.164
IPs:
IP:           192.168.2.164
Controlled By:  Runner/actions-runner-XXXXXX-actions-runner-deployment-7hjrn-dwhk9
Containers:
runner:
Container ID:   containerd://59cc7f8b66dc52ac02416a99a5409afe091da922276a1abea209db0943ed7f61
Image:          YYYYYYYY.jfrog.io/docker/custom-k8s-actions-runner:1.0.12
Image ID:       YYYYYYYY.jfrog.io/docker/custom-k8s-actions-runner@sha256:cd92d3c03ee945b845a48c049791fcb7a01a06157fa67a1df2fcf0336baa6bec
Port:           <none>
Host Port:      <none>
State:          Running
Started:      Tue, 13 Jun 2023 16:59:48 +0000
Ready:          True
Restart Count:  0
Limits:
cpu:     2
memory:  4Gi
Requests:
cpu:     2
memory:  4Gi
Environment:
RUNNER_ORG:                              
RUNNER_REPO:                             XXXXXX/somerepo
RUNNER_ENTERPRISE:                       
RUNNER_LABELS:                           self-hosted,XXXXXX
RUNNER_GROUP:                            Default
DOCKER_ENABLED:                          false
DOCKERD_IN_RUNNER:                       false
GITHUB_URL:                              https://github.YYYYYYYY.com/
RUNNER_WORKDIR:                          /runner/_work
RUNNER_EPHEMERAL:                        true
RUNNER_STATUS_UPDATE_HOOK:               true
GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT:  actions-runner-controller/v0.27.4
ACTIONS_RUNNER_CONTAINER_HOOKS:          /runner/k8s/index.js
ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER:    true
ACTIONS_RUNNER_POD_NAME:                 actions-runner-XXXXXX-actions-runner-deployment-7hjrn-dwhk9 (v1:metadata.name)
ACTIONS_RUNNER_JOB_NAMESPACE:            github-actions-runners-eksa-XXXXXX (v1:metadata.namespace)
ACTIONS_RUNNER_REQUIRE_SAME_NODE:        true
RUNNER_NAME:                             actions-runner-XXXXXX-actions-runner-deployment-7hjrn-dwhk9
RUNNER_TOKEN:                            REDACTEDREDACTEDREDACTEDREDACTEDREDACTEDREDACTEDREDACTEDREDACTEDREDACTEDREDACTEDREDACTEDREDACTEDREDACTED
Mounts:
/gpfs/CCCC-eksa-dev from CCCC-eksa-dev (ro)
/runner from runner (rw)
/runner/_work from work (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6m2nf (ro)
Conditions:
Type              Status
Initialized       True 
Ready             True 
ContainersReady   True 
PodScheduled      True 
Volumes:
work:
Type:          EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
StorageClass:  netapp-CCCC-ontap-nas-economy
Volume:        
Labels:            <none>
Annotations:       <none>
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
runner:
Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:     
SizeLimit:  <unset>
CCCC-eksa-dev:
Type:          HostPath (bare host directory volume)
Path:          /gpfs/CCCC-eksa-dev
HostPathType:  
kube-api-access-6m2nf:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3607
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                     node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type     Reason                  Age   From                     Message
----     ------                  ----  ----                     -------
Warning  FailedScheduling        13m   default-scheduler        0/4 nodes are available: 4 waiting for ephemeral volume controller to create the persistentvolumeclaim "actions-runner-XXXXXX-actions-runner-deployment-7hjrn-dwhk9-work".
Warning  FailedScheduling        13m   default-scheduler        0/4 nodes are available: 4 pod has unbound immediate PersistentVolumeClaims.
Normal   Scheduled               13m   default-scheduler        Successfully assigned github-actions-runners-eksa-XXXXXX/actions-runner-XXXXXX-actions-runner-deployment-7hjrn-dwhk9 to eks-a-node-1-workload1-md-0-85884886c9-q9j6q.dev.YYYYYYYY.com
  Normal   SuccessfulAttachVolume  13m   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-bcca1b33-9b52-472e-b893-989b52c9068e"
  Normal   Pulled                  13m   kubelet                  Container image "YYYYYYYY.jfrog.io/docker/custom-k8s-actions-runner:1.0.12" already present on machine
  Normal   Created                 13m   kubelet                  Created container runner
  Normal   Started                 13m   kubelet                  Started container runner

Workload Pod

Name:         actions-runner-XXXXXX-actions-runner-deployment-7h-workflow
Namespace:    github-actions-runners-eksa-XXXXXX
Priority:     0
Node:         eks-a-node-1-workload1-md-0-85884886c9-q9j6q.dev.YYYYYYYY.com/10.203.85.80
Start Time:   Tue, 13 Jun 2023 17:24:47 +0000
Labels:       runner-pod=actions-runner-XXXXXX-actions-runner-deployment-7hjrn-dwhk9
Annotations:  <none>
Status:       Running
IP:           192.168.2.203
IPs:
  IP:  192.168.2.203
Containers:
  job:
    Container ID:  containerd://18687bc551a95d9619adb720703ed2d5d9c8b5032700c6d32e86c7c5bca81159
    Image:         YYYYYYYY.jfrog.io/docker/ubi8/ubi:8.8
    Image ID:      YYYYYYYY.jfrog.io/docker/ubi8/ubi@sha256:a7143118671dfc61aca46e8ab9e488500495a3c4c73a69577ca9386564614c13
    Port:          <none>
    Host Port:     <none>
    Command:
      tail
    Args:
      -f
      /dev/null
    State:          Running
      Started:      Tue, 13 Jun 2023 17:24:49 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /__e from work (rw,path="externals")
      /__w from work (rw)
      /github/home from work (rw,path="_temp/_github_home")
      /github/workflow from work (rw,path="_temp/_github_workflow")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xfbc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  work:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  actions-runner-XXXXXX-actions-runner-deployment-7hjrn-dwhk9-work
    ReadOnly:   false
  kube-api-access-8xfbc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason   Age   From     Message
  ----    ------   ----  ----     -------
  Normal  Pulled   26s   kubelet  Container image "YYYYYYYY.jfrog.io/docker/ubi8/ubi:8.8" already present on machine
  Normal  Created  25s   kubelet  Created container job
  Normal  Started  25s   kubelet  Started container job

I think you have a huge delay in the k8be scheduler. Based on your output, it is really a cluster issue. The persistent volume could not be created on that node for 13 minutes, and after 13 minutes, the volume is attached. Only then the action executes within a minute. It seems to me that the scheduler could not find the node where this volume mount can be attached and once it does, the action executes quickly

I think you have a huge delay in the k8be scheduler. Based on your output, it is really a cluster issue. The persistent volume could not be created on that node for 13 minutes, and after 13 minutes, the volume is attached. Only then the action executes within a minute. It seems to me that the scheduler could not find the node where this volume mount can be attached and once it does, the action executes quickly

From what I can see, this isn't the case. The 13m is a very unfortunate coincidence here, but all of those events mentioned occurred at the same time (which was about 13 minutes between deploying the runner itself, and me actually manually starting the Actions workflow... my apologies for the confusion!). The relative timestamp here is to be interpreted as "13 mins ago".

The sequence as far as I understand it, and have just tested again to confirm:

Scheduler waits for CSI to bind PVC / make PV (recent attempt takes a few seconds)
PV is bound and Pod can now be assigned to node (few seconds)
Volume gets attached (effectively instantaenous after step 2)
Container gets created (effectively instant ...)
Container starts (effectively instant ...)

If I've somehow missed the point, do let me know, but I still maintain that the issue is as before.

Oh, I see... my bad...

I'll let you know as soon as I'm able to reproduce it.

@nikola-jokic Just an update here as I re-check my facts. A new strace (but filtered on file I/O syscalls) shows a curiously slow pattern of openat() for many of npm's packages. This leads me to scrutinise the underlying PV's CSI backend (NetApp Trident / NFS). I've done some basic performance tests, and believe that read vs. write is incongruously slow. The mount options look OK on the nodes, so I'll follow-up with the customer's on-site storage team to see if they're doing QoS on the SVM they gave me.

Will get back ASAP.

@nikola-jokic Thanks for your patience. I've switched the storageClass away from NetApp (which I found out has spinning disk, and despite having flashcache and tuning NFS to avoid too much metadata / file I/O overhead, performance did not improve much at all - 10s of seconds at most) to GPFS, and this problem completely disappeared.

As you and I discussed directly over mail, the main issue here is that the externals step sets up different versions of node, and this results in a storm of small file I/O (which is anathema to NFS performance). As NetApp Trident CSI is more and more prevalent these days (especially for on-prem k8s solutions), this is going to be a sensitivity unless the NetApp is SSD-based or so (even then, small file I/O will present a challenge depending on the use case).

Hope this helps future users, and I thank you for your time and support!

@cloudbustinguk Quick question, if I may - how did you get the runner to output the debug logging? I am facing an identical 14 min problem on AKS with PVC set to blob. I would like to debug my runners but so far has failed to get the same output as you have shown in the screenshot

@cloudbustinguk Quick question, if I may - how did you get the runner to output the debug logging? I am facing an identical 14 min problem on AKS with PVC set to blob. I would like to debug my runners but so far has failed to get the same output as you have shown in the screenshot

Hey there @nixlim! If you mean the first screenshot of the actions job, then when you click Re-run all jobs in an already-run-job summary page, you have the checkbox for Enable debug logging. The next set of screenshots are effectively those logs, but either downloaded from the cog wheel inside the job/steps view, or from the self-hosted runner directory (in our case).

In your case with the blob storage, it's likely the IOPS are too low, given the amount of node modules which get unpacked. Is it in scope for you to add a disk/SSD-based StorageClass and try it out?

Awesome! Thank you - this entire post and your work has solved 5 days of OCD "Why does it take so long" for me - 🙇🏻‍♂️

Yeah, we switched from Azure NFS to Fuse that brought the time down to 3-4 minutes. We could switch to Disk as well, although I am not 100% sure whether it would bring down the time any more.

Awesome! Thank you - this entire post and your work has solved 5 days of OCD "Why does it take so long" for me - 🙇🏻‍♂️

Yeah, we switched from Azure NFS to Fuse that brought the time down to 3-4 minutes. We could switch to Disk as well, although I am not 100% sure whether it would bring down the time any more.

You're most welcome! Glad it's sorted.

For anyone looking into this and working with AKS - we have solved the time delay issue by customising the hook (with, among other things, removing the copying of externals at the hook initialisation level) and then baking in externals both into the image for the workflow pod and the runner pod. This has allowed us to arrive at a scalable solution - pods now take seconds to get the job up and running.

We are experiencing exactly the same issue, but with EKS EFS. The pod initialization is taking about 3-4 minutes, with no hints in the logs available on the runner on what is exactly happening during that time. Once workflow pod is created, the execution takes seconds to complete.

@nixlim can you share your modified hook templates and what are the externals you moved to the runner, please?

@anlesk Hey! So, basically what is happening is that it is copying /externals directory to the workflow pod - that is Node16 and Node20. There are huge-ish and obviously you are copying over the network since it is actually copying to the PVC (mounted as a volume on both runner and job/workflow pod).

So, what we did:

From the hook implementation we removed the copyExternals command - we simplified the Typescript code for the hook, removed the bunch of stuff we didn't need and then build our own Typescript compilation of the hook. That is part of the modified runner image we use.
We copied the /externals directory from the runner image to the image we use to build workflow pod and provided the volumeMounts and volumes pointing to that location. The path is kept the same as in the original runner but now it comes with the content already baked in.
So now the /externals directory is baked into the image and does not need to be copied - it takes seconds to initialise.

Unfortunately, I cannot provide the actual code - I will get fired :)

actions / runner-container-hooks

`Initialize containers` step consistently takes around ~14 minutes to complete (no image pull issues) #82