docker / buildx

Docker CLI plugin for extended build capabilities with BuildKit
Apache License 2.0
3.57k stars 482 forks source link

`docker buildx build` got stuck if buildkit container dies #556

Open ojab opened 3 years ago

ojab commented 3 years ago

buildx-0.5.1, moby/buildkit:buildx-stable-1 (be8e8392f56c), Docker version 20.10.5, build 55c4c88 on linux x86_64.

  1. docker buildx build --platform=local -o . git://github.com/docker/buildx
  2. While build is in progress - docker exec -ti buildx_buildkit_builder-builder0 kill -s QUIT 1 where buildx_buildkit_builder-builder0 is the name of buildkit container
  3. docker buildx build hangs indefinitely
holograph commented 2 years ago

Happens with 0.7.1 as well: github.com/docker/buildx v0.7.1-docker 05846896d149da05f3d6fd1e7770da187b52a247 Docker version 20.10.12, build e91ed57

ceelian commented 2 years ago

Have a similar issue when doing a buildx bake. It hangs on random commands. Already investigated 3 days but it is very hard to track it down.

At first I thought it is something with the hcl files, because when i changed something there and some things in the dockerfile it worked sometimes. But then I changed something again and the previous version has stopped working although it worked a couple of minutes ago.

So it works/works not based on changes you do in the files but it doesn't matter what you change it is more related to which files you touch, the order of changes of the files. We have split the configuration to multiple files, but I could also reproduce it with a single hcl file and multiple dockerfiles.

So it is probably related to some internal scheduling based on what files are changed (config, dockerfile)? Could that be?

I also found out that it is very likely related to some extend to the contexts option int the .hcl files. We use to link to other targets to implement kind of a build order. The problematic (but for our case necessary) Dockerfile commands is the COPY --from command.

FROM bar AS foo
...
COPY --from=foo ...

The overall demo project looks like this:

# docker-bake.hcl
target "inh" {
    context   = "."
    cache-from = [
        "type=local,src=/Users/ceelian/tmp/buildx_cache"
    ]
    cache-to = ["type=local,dest=/Users/ceelian/tmp/buildx_cache"]
}

target "base" {
    inherits = ["inh"]
    dockerfile = "baseapp.Dockerfile"
    tags       = ["example.com/base:latest"]
}

target "third" {
    inherits = ["inh"]
    dockerfile = "thirdapp.Dockerfile"
    tags       = ["example.com/third:latest"]
}

target "another" {
    inherits = ["inh"]
    contexts = {
        "thirdapp" = "target:third",
    }
    dockerfile = "anotherapp.Dockerfile"
    tags       = ["example.com/another:latest"]
}

target "app" {
    inherits = ["inh"]
    dockerfile = "Dockerfile"
    contexts = {
        "baseapp" = "target:base",
        "anotherapp" = "target:another",
    }
    tags       = ["example.com/testapp:latest"]
}

# anotherapp.Dockerfile
FROM alpine:3.15.3
RUN ["touch", "another.txt"]

# baseapp.Dockerfile
FROM python:3.10.4-alpine3.15
RUN ["touch", "hello.txt"]

# thirdapp.Dockerfile
FROM alpine:3.15.3
RUN ["touch", "third.txt"]

# Dockerfile
FROM baseapp
FROM anotherapp
FROM python:3.8.6-alpine

COPY --from=0 /hello.txt /hello.txt
COPY --from=1 /another.txt /another.txt
RUN echo "Hello world"

# The commands to start the build
$ docker buildx create --name mybuilder --node mybuilder0 \
  --platform linux/arm64,linux/riscv64,linux/ppc64le,linux/s390x,linux/mips64le,linux/mips64,linux/arm/v7,linux/arm/v6,linux/amd64 \
  --driver-opt env.BUILDKIT_STEP_LOG_MAX_SIZE=10000000 --driver-opt env.BUILDKIT_STEP_LOG_MAX_SPEED=10000000
$ docker buildx bake --load -f docker-bake.hcl --builder mybuilder  app

I also tried --no-cache flags and even introduced a ARG CACHEBUST based on the idea of Sebastion on https://www.freecodecamp.org/news/docker-cache-tutorial/ but in the end I couldn't reliable reproduce the error.

The next step is trying to workaround the issue by trying to combine the individual dependent images in a single multistage image hoping that without the contexts the copy --from will not get the build process hang indefinitely.

If you need any logs, please just tell me how to get them and I can add them here.

My docker info:

docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.8.1)
  compose: Docker Compose (Docker Inc., v2.3.3)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 46
  Running: 41
  Paused: 0
  Stopped: 5
 Images: 99
 Server Version: 20.10.13
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2a1d4dbdb2a1030dc5b01e96fb110a9d9f150ecc
 runc version: v1.0.3-0-gf46b6ba
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.104-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: aarch64
 CPUs: 5
 Total Memory: 31.31GiB
 Name: docker-desktop
 ID: 7UQR:JSMW:UI4R:HPKG:7VPM:T5TT:UKJO:UNX3:DBCT:W5XF:CL5L:VOIO
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false
ceelian commented 2 years ago

If anyone faces the same issue, I "solved" it by removing all "contexts" from the bake-config.hcl files.

I recombined all previously separated dockerfiles to a few independent multi-stage dockerfiles. That way I could remove the "dependencies" from the bake-config.hcl files. So I didn't need to "contexts" section anymore and could remove all of them and now the build seems quite reliable locally on amd64, arm64 and on the CI system.

This doesn't solve the issue but it is a workaround.

ssbarnea commented 1 month ago

I am facing the same issue and I want to find a trick to properly identify these leftovers and prune them regularly. Any ideas?

# docker ps -a
CONTAINER ID   IMAGE                           COMMAND                  CREATED      STATUS      PORTS     NAMES
2ee4bc562996   moby/buildkit:buildx-stable-1   "buildkitd --allow-i…"   2 days ago   Up 2 days             buildx_buildkit_builder-885de0ba-d701-4918-a9a8-ce82331cebc30
6505805d2d78   moby/buildkit:buildx-stable-1   "buildkitd --allow-i…"   5 days ago   Up 5 days             buildx_buildkit_builder-b0bcf721-0454-4a21-836f-6f03a4f4efb80
e6dd9f70086b   32aa1a493317                    "buildkitd --allow-i…"   9 days ago   Up 8 days             buildx_buildkit_builder-dded1ecb-4d51-4beb-944d-f8bf2e39653e0
475bfabb6ee3   32aa1a493317                    "buildkitd --allow-i…"   9 days ago   Up 8 days             buildx_buildkit_builder-c219ed3c-04a9-4417-b6ed-5474e43da7bc0