The purpose of this project is to run GitHub Actions on prem via our Slurm cluster.
flowchart LR
GitHubAPI[("GitHub API")]
ActionsRunners[("Allocator")]
Slurm[("Slurm Compute Resources")]
ActionsRunners --> | Poll Queued Jobs | GitHubAPI
ActionsRunners -->| Allocate Actions Runner| Slurm
graph TD
A[Docker Rootless Daemon] -->| Creates | B[Docker Rootless Socket]
B -->| Creates | C[Custom Actions Runner Image]
C -.->| Calls | B
C --->| Mounts | B
C -->| Creates | E[CI Helper Containers]
E -.->| Calls | B
Since CI Docker commands will use the same filesystem, as they have the same Docker socket, you need to configure the working directory of your runners accordingly.
After we were able to run the actions runner image in as Slurm job using sbatch and custom script we ran into the issue of having to pull the docker image for every job. From the time the script allocated resources to the time the job began was ~ 2 minutes. When you are running 70+ jobs in a workflow, with some jobs depending on others, this time adds up fast.
Unfortunately, caching the image is not an elegant solution because this would require mounting the filesystem directory to the Slurm job. This means we would need to have multiple directories if we wanted to support multiple concurrent runners. This would require creating a system to manage these directories and would introduce the potenital for starvation and dead locks.
This led us to investegate a Docker pull through cache.