docker / buildx

Docker CLI plugin for extended build capabilities with BuildKit
Apache License 2.0
3.58k stars 482 forks source link

`docker inspect --bootstrap` ignores an error and exits successfully when one of two nodes fails #2730

Open moleus opened 1 month ago

moleus commented 1 month ago

Contributing guidelines

I've found a bug and checked that ...

Description

In our CI we setup buildx builder with 2 nodes - arm64 and amd64 to build multi-arch docker images. CI runs docker inspect --bootstrap to check that both of them are up and ready. When one node fails (because of a timeout) the inspect command ignores it, prints the summary and completes with exit code 0.

Expected behaviour

docker inspect --bootstrap should exit with non-zero exit-code when one of the nodes fails to start. When command fails it's possible to catch a non-zero status code and retry bootstraping the nodes.

Actual behaviour

When single node fails the driver.Boot returns an error which is stored in the channel errCh. printer.Wait() never returns an error. So err is nil and len(errCh) == 1. So the last return statement is executed with true, nil. For more details see my gist with minimal reproducible example. https://github.com/docker/buildx/blob/65c475647313b3edc502dcdd48b8ce4128b60dee/builder/builder.go#L178-L206

printer.Wait(): https://github.com/docker/buildx/blob/65c475647313b3edc502dcdd48b8ce4128b60dee/util/progress/printer.go#L39-L45

pw.err is assigned here https://github.com/docker/buildx/blob/65c475647313b3edc502dcdd48b8ce4128b60dee/util/progress/printer.go#L133

but UpdateFrom never sets the error, except the ctx.Done case. https://github.com/docker/buildx/blob/171fcbeb69d67c90ba7f44f41a9e418f6a6ec1da/vendor/github.com/moby/buildkit/util/progress/progressui/display.go#L80-L80

Buildx version

v0.14.0

Docker info

No response

Builders list

-

Configuration

Here is the link to this gist with code to reproduce bug in the (b *Builder) Boot() function: https://gist.github.com/moleus/6bf1ad37f47dd86ed12f4c4345601e22

Build logs

+ docker buildx inspect --bootstrap
#1 [arm64-demo-jenkins-job-name internal] booting buildkit
#1 waiting for 1 pods to be ready
#1 ...
#2 [amd64-demo-jenkins-job-name internal] booting buildkit
#2 ...
#1 [arm64-demo-jenkins-job-name internal] booting buildkit
#1 waiting for 1 pods to be ready 93.3s done
#1 DONE 93.4s
#2 [amd64-demo-jenkins-job-name internal] booting buildkit
#2 waiting for 1 pods to be ready 109.5s done
#2 ERROR: expected 1 replicas to be ready, got 0
------
 > [amd64-demo-jenkins-job-name internal] booting buildkit:
------
Name:          back-demo-buildx        <- *comment* this should not be printed. Should exit with 1 
Driver:        kubernetes
Last Activity: 2024-10-10 09:56:43 +0000 UTC
Nodes:
Name:                  amd64-demo-jenkins-job-name
Endpoint:              kubernetes:///back-demo-buildx?deployment=amd64-demo-jenkins-job-name&kubeconfig=
Driver Options:        loadbalance="random" namespace="demo-namespace" ...
Status:                inactive
BuildKit daemon flags: --oci-worker-gc-keepstorage 8000 --allow-insecure-entitlement=network.host
Platforms:             linux/amd64*
Name:                  arm64-demo-jenkins-job-name
Endpoint:              kubernetes:///back-demo-buildx?deployment=arm64-demo-jenkins-job-name&kubeconfig=
Driver Options:        loadbalance="random" namespace="demo-namespace" ...
Status:                inactive
BuildKit daemon flags: --oci-worker-gc-keepstorage 8000 --allow-insecure-entitlement=network.host
Platforms:             linux/arm64*

Additional info

I suggest replacing

    if err == nil && len(errCh) == len(toBoot) {  // errCh contains only one error -> condition fails
        return false, <-errCh
    }
    return true, err  // returns true, nil

with

    if err == nil && len(errCh) > 0 {
        return false, <-errCh
    }
    return true, err  // returns true, nil

so inspect fails when one of the nodes fails to setup