docker / build-push-action

GitHub Action to build and push Docker images with Buildx
https://github.com/marketplace/actions/build-and-push-docker-images
Apache License 2.0
4.39k stars 559 forks source link

buildx failed with: ERROR: failed to solve: rpc error: code = Unknown desc = open (arm64) #1231

Open ssbarnea opened 1 month ago

ssbarnea commented 1 month ago

Contributing guidelines

I've found a bug, and:

Description

I recently started to seem some failures which seem to happen only on the arm64 runner so far (we run amd64 and arm64 in parallel on different runners).

#27 exporting to docker image format
#27 exporting layers
#27 exporting layers 35.7s done
#27 exporting manifest sha256:98a526cdb689c3794974e1c755ceb5231df925e5e1446847e02dc748bddf1bc6 done
#27 exporting config sha256:07187ea785e7d24d2beab5a8fde12c3709fc73a9079a587fb78b7b5902e98e57 done
#27 sending tarball
#27 ...

#28 importing to docker
#28 loading layer abecb16ce073 11.31kB / 11.31kB 0.1s done
#28 ERROR: open /var/lib/docker/overlay2/4d64ef876800e0f0b614fb7dd9698ec08025a8ee71ef0215f512b45cc038b5d8/.tmp-committed1515084540: no such file or directory

#27 exporting to docker image format
#27 sending tarball 6.4s done
#27 ERROR: rpc error: code = Unknown desc = open /var/lib/docker/overlay2/4d64ef876800e0f0b614fb7dd9698ec08025a8ee71ef0215f512b45cc038b5d8/.tmp-committed1515084540: no such file or directory

Example:

Expected behaviour

Pass the build

Actual behaviour

Failing

Repository URL

No response

Workflow run URL

No response

YAML workflow

-

Workflow logs

No response

BuildKit logs

https://github.com/ansible/ansible-dev-tools/actions/runs/11069017162/job/30755940313?pr=382#step:6:3610

Additional info

This runner is using ubuntu 24.04 and docker info reports

 docker info
Client: Docker Engine - Community
 Version:    27.2.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.2
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 27.2.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-1016-aws
 Operating System: Ubuntu 24.04.1 LTS
 OSType: linux
 Architecture: aarch64
 CPUs: 2
 Total Memory: 3.742GiB
 Name: ip-10-0-1-209
 ID: d13c7458-d626-491f-b9d9-5f0e2a661a08
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
crazy-max commented 1 month ago

This runner is using ubuntu 24.04

Seems to be a self-hosted runner from what I see in run logs: https://github.com/ansible/ansible-dev-tools/actions/runs/11069017162/job/30755940313?pr=382#step:1:2

Current runner version: '2.319.1'
Runner name: 'devtools-arm64-runner'
Runner group name: 'Default'
Machine name: 'ip-10-0-1-209'
Testing runner upgrade compatibility

It's quite hard to figure out what's going on as there are nested composite actions on this repo. I also can't see the changes anymore on the related pr https://github.com/ansible/ansible-dev-tools/pull/382.

Can you create a small repro please? Thanks

ssbarnea commented 1 month ago

I will, this error does not always reproduce, seems to be random, with something like 1/4 chances to happen. I know that this might be related to the machine itself but the reality is that the error is quite opaque, not giving any hints on why this might happen or where to look for details. As this is a permanent runner, I can easily get access to the logs. Once the PR gets in, it will be easier to look at the workflow.

crazy-max commented 1 month ago

not giving any hints on why this might happen or where to look for details. As this is a permanent runner, I can easily get access to the logs.

If you can give us the buildkit container logs that would help: https://docs.docker.com/build/ci/github-actions/configure-builder/#buildkit-container-logs

ssbarnea commented 1 month ago

I had another run with --debug and got the logs in GHA console at https://github.com/ansible/ansible-dev-tools/actions/runs/11071410923/job/30763513199?pr=382#step:9:49 - I guess should should be able to read them? No need to export them in any way, I hope.

crazy-max commented 1 month ago
#28 importing to docker
#28 loading layer abecb16ce073 11.31kB / 11.31kB 0.1s done
#28 ERROR: open /var/lib/docker/overlay2/4d64ef876800e0f0b614fb7dd9698ec08025a8ee71ef0215f512b45cc038b5d8/.tmp-committed1123400293: no such file or directory

Hum this might an issue with docker engine on this self-hosted runner as this happens when loading the image to docker store. Can you provide docker daemon logs?

  Client: Docker Engine - Community
   Version:    27.2.1
   Context:    default
   Debug Mode: false
   Plugins:
    buildx: Docker Buildx (Docker Inc.)
      Version:  v0.16.2
      Path:     /usr/libexec/docker/cli-plugins/docker-buildx
    compose: Docker Compose (Docker Inc.)
      Version:  v2.29.2
      Path:     /usr/libexec/docker/cli-plugins/docker-compose

  Server:
   Containers: 1
    Running: 1
    Paused: 0
    Stopped: 0
   Images: 3
   Server Version: 27.2.1
   Storage Driver: overlay2
    Backing Filesystem: extfs
    Supports d_type: true
    Using metacopy: false
    Native Overlay Diff: true
    userxattr: false
   Logging Driver: json-file
   Cgroup Driver: systemd
   Cgroup Version: 2
   Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
   Swarm: inactive
   Runtimes: io.containerd.runc.v2 runc
   Default Runtime: runc
   Init Binary: docker-init
   containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
   runc version: v1.1.14-0-g2c9f560
   init version: de40ad0
   Security Options:
    apparmor
    seccomp
     Profile: builtin
    cgroupns
   Kernel Version: 6.8.0-1016-aws
   Operating System: Ubuntu 24.04.1 LTS
   OSType: linux
   Architecture: aarch64
   CPUs: 2
   Total Memory: 3.742GiB
   Name: ip-10-0-1-209
   ID: d13c7458-d626-491f-b9d9-5f0e2a661a08
   Docker Root Dir: /var/lib/docker
   Debug Mode: false
   Experimental: false
   Insecure Registries:
    127.0.0.0/8

And looking at docker info https://github.com/ansible/ansible-dev-tools/actions/runs/11071410923/job/30763513199?pr=382#step:6:132, I wonder if this has smth to do with Native Overlay Diff: true. Maybe disk space?