gitlab runner issue - Githubissues

lsm5 commented 1 year ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

I have followed the steps for enabling podman as a gitlab runner on gitlab.com. podman.socket is enabled and active yet new jobs consistently fail.

Now, if I run a systemctl --user status podman.socket and then try the CI jobs again, they pass. But the error comes back for the next hourly run.

Steps to reproduce the issue:

Enable podman as a gitlab runner on Fedora or CentOS Stream

Describe the results you received:

ERROR: Failed to remove network for build
ERROR: Preparation failed: Cannot connect to the Docker daemon at unix:///run/user/1000/podman/podman.sock. Is the docker daemon running? (docker.go:863:0s)
Will be retried in 3s ...
ERROR: Failed to remove network for build
ERROR: Preparation failed: Cannot connect to the Docker daemon at unix:///run/user/1000/podman/podman.sock. Is the docker daemon running? (docker.go:[8](https://gitlab.com/rhcontainerbot/pkg-builder/-/jobs/3104542584#L8)63:0s)
Will be retried in 3s ...
ERROR: Failed to remove network for build
ERROR: Preparation failed: Cannot connect to the Docker daemon at unix:///run/user/[10](https://gitlab.com/rhcontainerbot/pkg-builder/-/jobs/3104542584#L10)00/podman/podman.sock. Is the docker daemon running? (docker.go:863:0s)
Will be retried in 3s ...
ERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///run/user/1000/podman/podman.sock. Is the docker daemon running? (docker.go:863:0s)

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Latest available podman on Fedora 37 or CentOS 9 Stream

Additional environment details (AWS, VirtualBox, physical, etc.):

The runner instance is on GCE.

Luap99 commented 1 year ago

Can you check systemctl --user status podman.service, maybe the podman system service process itself is broken and not accepting new connections?

lsm5 commented 1 year ago

It's always been active whenever I've checked. Or, if I have to wildly speculate, it became active after running that command and started accepting jobs, which, so far, is the only explanation I have.

$ systemctl --user status podman.socket
● podman.socket - Podman API Socket
     Loaded: loaded (/usr/lib/systemd/user/podman.socket; enabled; preset: disabled)
     Active: active (listening) since Fri 2022-09-30 14:42:47 UTC; 21min ago
      Until: Fri 2022-09-30 14:42:47 UTC; 21min ago
   Triggers: ● podman.service
       Docs: man:podman-system-service(1)
     Listen: /run/user/1000/podman/podman.sock (Stream)
     CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/podman.socket

Sep 30 14:42:47 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal systemd[1704]: Listening on podman.socket - Podman API Socket.

See the continuously failing jobs in the pipeline list at: https://gitlab.com/rhcontainerbot/pkg-builder/-/pipelines . The 2nd one from top which succeeded was a result of a manual retry, and the most recent one was automatically run only a few minutes after the manual rerun, so I guess the socket stayed active in that time interval.

Here's runner config info if it helps:

$ sudo cat /etc/gitlab-runner/config.toml
concurrent = 50
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "lmandvek-fedora-gitlab-runner"
  url = "https://gitlab.com"
  id = 17760853
  token = $TOKEN_REDACTED
  token_obtained_at = 2022-09-27T18:26:21Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker"
  environment = ["FF_NETWORK_PER_BUILD=0"]
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "registry.gitlab.com/rhcontainerbot/pkg-builder:fedora"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
    host = "unix:///run/user/1000/podman/podman.sock"

Luap99 commented 1 year ago

not the socket, the service! systemctl --user status podman.service. As long as the system service process is still running systemd will not spawn a new service process. Maybe the podman system service process gets stucked.

lsm5 commented 1 year ago

ah whoops, i read your previous comment wrong, my bad.

$ systemctl --user status podman.service
○ podman.service - Podman API Service
     Loaded: loaded (/usr/lib/systemd/user/podman.service; disabled; preset: disabled)
     Active: inactive (dead) since Fri 2022-09-30 15:02:31 UTC; 11min ago
   Duration: 1min 6.588s
TriggeredBy: ● podman.socket
       Docs: man:podman-system-service(1)
    Process: 12768 ExecStart=/usr/bin/podman $LOGGING system service (code=exited, status=0/SUCCESS)
   Main PID: 12768 (code=exited, status=0/SUCCESS)
        CPU: 23.629s

Sep 30 15:02:26 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal podman[12768]: @ - - [30/Sep/2022:15:02:26 +0000] "DELETE /v1.41/containers/754fdda1700be63ff3f8d77689b5a2c5832d09a33c2f15d862509dca6e00153e?force=1&v=1 HTTP/1.1" 204 0 "" "Go-http-client/1.1"
Sep 30 15:02:26 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal podman[12768]: time="2022-09-30T15:02:26Z" level=info msg="Request Failed(Internal Server Error): container 754fdda1700be63ff3f8d77689b5a2c5832d09a33c2f15d862509dca6e00153e does not exist in database: no such container"
Sep 30 15:02:26 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal podman[12768]: @ - - [30/Sep/2022:15:02:26 +0000] "GET /v1.41/networks HTTP/1.1" 500 178 "" "Go-http-client/1.1"
Sep 30 15:02:26 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal podman[12768]: 2022-09-30 15:02:26.376422149 +0000 UTC m=+61.273672352 container remove 177d5ba021688d30e59c21afdae84a7cca6314f31515f9397ba070b55fdd3e33 (image=registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-43b2dc3d, name>
Sep 30 15:02:26 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal podman[12768]: @ - - [30/Sep/2022:15:02:26 +0000] "DELETE /v1.41/containers/177d5ba021688d30e59c21afdae84a7cca6314f31515f9397ba070b55fdd3e33?force=1&v=1 HTTP/1.1" 204 0 "" "Go-http-client/1.1"
Sep 30 15:02:26 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal podman[12768]: 2022-09-30 15:02:26.582305812 +0000 UTC m=+61.479556035 container remove 1be38a82e621e1b53c1faf098bb5ee2b2283d6a0e0e113c7c09e647a75686dbe (image=registry.gitlab.com/rhcontainerbot/pkg-builder:fedora, name=runner-uf1gckrg-project-132>
Sep 30 15:02:26 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal podman[12768]: @ - - [30/Sep/2022:15:02:26 +0000] "DELETE /v1.41/containers/1be38a82e621e1b53c1faf098bb5ee2b2283d6a0e0e113c7c09e647a75686dbe?force=1&v=1 HTTP/1.1" 204 0 "" "Go-http-client/1.1"
Sep 30 15:02:26 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal podman[12768]: 2022-09-30 15:02:26.593532671 +0000 UTC m=+61.490782904 container remove e381c291e8490744c830688870eff6f8062b6c1f9b8f1cda09875d61dc61b9c5 (image=registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-43b2dc3d, name>
Sep 30 15:02:26 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal podman[12768]: @ - - [30/Sep/2022:15:02:26 +0000] "DELETE /v1.41/containers/e381c291e8490744c830688870eff6f8062b6c1f9b8f1cda09875d61dc61b9c5?force=1&v=1 HTTP/1.1" 204 0 "" "Go-http-client/1.1"
Sep 30 15:02:31 lmandvek-fedora-gitlab-runner.c.libpod-218412.internal systemd[1704]: podman.service: Consumed 23.629s CPU time.

lsm5 commented 1 year ago

do i need to enable the service and keep it enabled explicitly? The current docs don't mention it https://docs.gitlab.com/runner/executors/docker.html so maybe that needs to change?

Luap99 commented 1 year ago

No, the socket should start the service once a connections happen. The podman service process will then exit when it does not handle active connections after 5 seconds. For the next connection systemd should start it again, looks like it exited in your output so I would think the systemd socket should start it again.

Can you try to curl the socket manually and see if this works? Or just use podman-remote. If this works the gitlab runner is doing something weird.

lsm5 commented 1 year ago

things seem to work better with enabling a system connection. I'll keep checking how runs go over the next few days. Thanks @Luap99

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

lsm5 commented 1 year ago

Closing this one as I'll likely be using @cevich's podman-in-podman method which is waiting on https://github.com/containers/podman/issues/16576

cevich commented 1 year ago

FWIW: Failed to remove network for build I initially hit this and found it for gitlab-runner, inside a container, you pretty much have to use host-mode networking. Podman's networking vs docker is just too much of a difference for the runner to handle.

containers / podman

gitlab runner issue #15997