actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.73k stars 1.12k forks source link

One of the job steps hangs and fail without any log messages #2243

Open omerlh opened 1 year ago

omerlh commented 1 year ago

Checks

Controller Version

v0.27.0

Helm Chart Version

0.22.0

CertManager Version

v1.9.1

Deployment Method

Helm

cert-manager installation

Using the official helm chart - https://charts.jetstack.io

Checks

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: default-runner
spec:
  template:
    spec:
      organization: goledge
      image: <our custom image, see dockerfile bellow>
      dockerdWithinRunnerContainer: true
      labels:
        - self-hosted
        - linux
      priorityClassName: low

FROM summerwind/actions-runner-dind@sha256:7c7a5dd8839644de4149da9150a19910930928bc9651ee40eec904567ae348ad

RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add - \
  && echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list \
  && sudo apt-get install ca-certificates curl gnupg lsb-release \
  && sudo mkdir -p /etc/apt/keyrings \
  && echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list \
  && curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg \
  && sudo apt update && sudo apt install --no-install-recommends yarn docker-compose-plugin default-jdk amazon-ecr-credential-helper -y \
  && sudo apt-get install -y \
  && java -version \
  && wget https://github.com/bazelbuild/bazelisk/releases/download/v1.14.0/bazelisk-linux-amd64 -O /tmp/bazelisk && sudo mv /tmp/bazelisk /usr/local/bin/bazelisk && sudo chmod +x /usr/local/bin/bazelisk && sudo ln -s /usr/local/bin/bazelisk /usr/local/bin/bazel  \
  && bazelisk --version && bazel --version

### To Reproduce

```markdown
I am not sure how TBH

Describe the bug

Recently we noticed job failing without any clear indication about the why:

image

This is the hang, it usually fails after job timeout time. There are no logs anywhere, so I have no idea why it is failing

Describe the expected behavior

Job runs without failures and not hanging

Whole Controller Logs

https://gist.github.com/omerlh/0e958a197c0fc4ea1be83814fb17f444

Whole Runner Pod Logs

https://gist.github.com/omerlh/0e958a197c0fc4ea1be83814fb17f444

Additional Context

No response

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

omerlh commented 1 year ago

We suspect this might be related to Karpenter removing a node when the runner pod is running on, and lack of PDB does not block this operation. I saw the chart does not have one, maybe worth adding?