docker / compose

Define and run multi-container applications with Docker
https://docs.docker.com/compose/
Apache License 2.0
34.12k stars 5.25k forks source link

[BUG] Overlay network not found on worker node #11894

Open thormme opened 5 months ago

thormme commented 5 months ago

Description

Issue: Swarm worker hosts fail to attach to manager node overlay networks unless a container has been manually started and attached to the network using docker run --network swarm-overlay

Expected Behavior: This should automatically attach to the overlay network and it should be visible in the docker network info.

$> docker network ls
8e3c351af333   bridge             bridge    local
0cbc0420c111   docker_gwbridge    bridge    local
x8gb7mz6s222   swarm-overlay      overlay   swarm
c09ad17a7321   host               host      local
keth4xuub123   ingress            overlay   swarm
d8baa27f3654   none               null      local

Workaround: The only solution I have found is to downgrade to an earlier version (2.21.0-1) of docker-compose-plugin

sudo apt list -a docker-compose-plugin
sudo apt install docker-compose-plugin=2.21.0-1~debian.11~bullseye

I believe this is the same issue as https://github.com/docker/compose/issues/11387 but i couldn't find any open bugs with the same issue.

Thanks for any help with this!

Steps To Reproduce

I created a custom overlay network on the swarm manager node.

...
  service:
    image: service-image
    container_name: service
    networks:
      - swarm-overlay
    restart: unless-stopped
...
networks:
  swarm-overlay:
    attachable: true
    driver: overlay

This correctly created the network and attached the relevant container to it.

I then joined a worker host to the swarm and attempted to connect a container to the overlay network.

...
worker-service:
    image: worker-image
    container_name: worker-service
    networks:
      swarm-overlay:
        aliases:
          - host1-worker-service
    restart: unless-stopped
...
networks:
  swarm-overlay:
    external: true
    driver: overlay

docker compose up -d worker-service This errors with:

Error response from daemon: network swarm-overlay not found

Compose Version

docker-compose-plugin/bullseye 2.27.1-1~debian.11~bullseye
Docker Compose version v2.27.1

Docker Environment

Client: Docker Engine - Community
 Version:    26.1.4
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 12
  Running: 5
  Paused: 0
  Stopped: 7
 Images: 31
 Server Version: 26.1.4
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: 2brhg9vzj8m47oyo40ie5yj0u
  Is Manager: false
  Node Address: 1.2.3.4
  Manager Addresses:
   4.3.2.1:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d2d58213f83a351ca8f528a95fbd145f5654e957
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-28-cloud-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 13.42GiB
 Name: cloud-machine
 ID: 6c0ae974-1ba3-450a-ab03-d31b31c6097f
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Anything else?

No response

ndeloof commented 5 months ago

This isn't the same issue as #11387 as here this is the docker engine reporting error: Error response from daemon: network swarm-overlay not found

Can you please confirm you can use docker run --network swarm-overlay ... to run equivalent container on worked node with this swarm setup ?

jsunstrom commented 5 months ago

I'm running into this exact same issue using Docker Compose 2.27.0. I can confirm that I can use docker run -it --name alpine1 --network test-net alpine from the official documentation. I walked through the entirety of the "Use an overlay network for standalone containers" and it worked as expected.

However, using docker compose files, I also get the error Error response from daemon: network <my network name here> not found message using docker compose up -d.

ambretanmay commented 5 months ago

I am having the exact same issue. Docker Compose version v2.27.1 @ndeloof docker run --network swarm-overlay works and compose doesn't

inql commented 5 months ago

btw is the downgrade workaround needed for both leader and worker node?

ambretanmay commented 5 months ago

@inql I have not tested this as our scripts set versions for all nodes.

michaelmcandrew commented 4 months ago

Hey there, also affected by this bug.

If you don't want to downgrade another workaround is to create a container and attach it to the network. It then appears in the list and docker compose no longer complains

docker run -dit --name keep-alive --network --restart=always <network_name> alpine

Adding --restart=always will ensure that it survives restarts of the docker daemon, etc.

My versions in case it is useful:

docker version

Client: Docker Engine - Community Version: 27.0.3 API version: 1.46 Go version: go1.21.11 Git commit: 7d4bcd8 Built: Sat Jun 29 00:02:50 2024 OS/Arch: linux/amd64 Context: default

Server: Docker Engine - Community Engine: Version: 27.0.3 API version: 1.46 (minimum version 1.24) Go version: go1.21.11 Git commit: 662f78c Built: Sat Jun 29 00:02:50 2024 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.7.18 GitCommit: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e runc: Version: 1.7.18 GitCommit: v1.1.13-0-g58aa920 docker-init: Version: 0.19.0 GitCommit: de40ad0

docker compose version Docker Compose version v2.28.1

kulpsin commented 4 months ago

As in above, sorry did not realise that @michaelmcandrew also mentioned this but at least this comment confirms his findings: https://github.com/docker/compose/issues/11894#issuecomment-2206522846

I tested this issue and noticed that if there exists running container which has connection to the external overlay network (started with docker run ... and visible in docker network ls), then the compose is able to connect to the external overlay network.

So, without knowing anything about internals, the problem might have something to do with not checking for available external overlay networks but instead checking just internal networks (visible with docker network ls).

So as an additinal workaround it is possible to first start "dummy" container on workers via for example:

$ docker compose up -d
Error response from daemon: network <overlay-network> not found
$ run -dit --rm --name dummy-network-container --network <overlay-network> alpine
43924b1b25ac73373aac9120b55ac46fc1de3435ce26485682e11d6c06671936
$ docker compose up -d
[+] Running 1/0
 ✔ Container worker-service  Started
$ _

I also checked downgrading and for Ubuntu 22.04 it worked, so I think I will be using downgraded version for now myself. sudo apt-get remove docker-compose-plugin && sudo apt-get install docker-compose-plugin=2.21.0-1~ubuntu.22.04~jammy

$ docker version
Client: Docker Engine - Community
 Version:           27.0.3
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        7d4bcd8
 Built:             Sat Jun 29 00:02:33 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:02:33 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

$ docker compose version
Docker Compose version v2.28.1
ndeloof commented 4 months ago

@kulpsin docker network ls indeed does not detect overlay networks created on another swarm node (not sure about the reason, but that's what we get with the engine API) until it is used by some container. So Docker Compose can't check network existence, but should detect swarm is enabled and ignore error (assuming container create will fail if there's an actual missing network). See https://github.com/docker/compose/blob/11d5ecdc75ab96214f35db4cdc0361ee080d1c07/pkg/compose/create.go#L1334-L1340

Not sure why this doesn't work as expected, need to setup a test environment and try to reproduce this bug

jhrotko commented 4 months ago

With the original compose.yml it would generate swarm-netword-overlay_swarm-overlay network

Screenshot 2024-07-18 at 15 57 57

...and then the worker would not be able to find the external network as expected

By adding the name: swarm-overlay on the network it made it work for me for version v2.28.1 docker compose up -d

...
  service:
    image: service-image
    container_name: service
    networks:
      - swarm-overlay
    restart: unless-stopped
...
networks:
  swarm-overlay:
    name: swarm-overlay <---- 
    attachable: true
    driver: overlay

after this it generates the following result for docker network ls

Screenshot 2024-07-18 at 16 00 19

and now the worker is referencing the right network

Screenshot 2024-07-18 at 16 07 00
michaelmcandrew commented 4 months ago

To flesh out my steps to reproduce a bit more, since they are slightly different from the ones mentioned above, I created a swarm network on the lead node with docker network create --driver overlay test --attachable.

This network was not visible on the worker node (expected I think because nothing was connected).

However, I was not able to connect to it with the below networks section in a compose.yaml on the worker node.

networks:
  test:
    external: true

I created the following container on the worker node docker run -dit --name keep-alive --network test --restart=always alpine

I was then able to connect using the above networks section in a compose.yaml on the worker node.

Hope that help with the reproduction!

tuxthepenguin84 commented 1 month ago

I created the following container on the worker node docker run -dit --name keep-alive --network test --restart=always alpine

Thanks this worked for me.

tuxthepenguin84 commented 1 month ago

Is this a bug in compose? I would expect somewhat feature parity between docker and docker compose.

ndeloof commented 1 month ago

@tuxthepenguin84 docker compose does some client-side validation before running containers, and as such looks for target network to exist. docker run will just fail if not found, without preliminary validation. Can you please confirm issue persists with latest version ? AFAIK we had a fix for it

tuxthepenguin84 commented 1 month ago

It appears to me the issue still persists, at least for me and my use case.

Docker Compose version v2.29.7
Client: Docker Engine - Community
 Version:           27.3.1
 API version:       1.47
 Go version:        go1.22.7
 Git commit:        ce12230
 Built:             Fri Sep 20 11:41:00 2024
 OS/Arch:           linux/amd64
 Context:           default
[+] Running 3/3
 ✔ Container proxy2-nginx-exporter  Removed                                                                                                        0.5s
 ✔ Container proxy2                 Removed                                                                                                        1.8s
 ✔ Network proxy_default            Removed                                                                                                        0.4s
[+] Running 2/3
 ✔ Network proxy_default            Created                                                                                                        0.8s
 ⠸ Container proxy2                 Starting                                                                                                       2.3s
 ✔ Container proxy2-nginx-exporter  Started                                                                                                        2.0s
Error response from daemon: could not find a network matching network mode jf5y7525s7qqt0333lfolwruk: network jf5y7525s7qqt0333lfolwruk not found
[
    {
        "Name": "ai",
        "Id": "jf5y7525s7qqt0333lfolwruk",
        "Created": "2024-10-06T20:26:15.848600039Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.3.0/24",
                    "Gateway": "10.0.3.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4099"
        },
        "Labels": null
    }
]

The network is there.

services:
  proxy2:
    image: nginx:latest
    container_name: proxy2
    restart: unless-stopped
    networks: ['ai', 'collaboration', 'core', 'garage', 'health', 'iot', 'olivetin', 'media', 'metrics', 'proxy', 'security', 'sprinklers']
    ports:
      - 443:443
    volumes:
      - /containers/proxy/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - /containers/proxy/nginx/conf.d:/etc/nginx/conf.d:ro
      - /containers/proxy/dhparams.pem:/etc/ssl/dhparams.pem:ro
      - /certs/delchampsio/fullchain.pem:/etc/ssl/delchampsio/fullchain.pem:ro
      - /certs/delchampsio/privkey.pem:/etc/ssl/delchampsio/privkey.pem:ro
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro

  proxy2-nginx-exporter:
    image: nginx/nginx-prometheus-exporter:latest
    container_name: proxy2-nginx-exporter
    restart: unless-stopped
    ports:
      - 9113:9113
    command:
      - --nginx.scrape-uri=http://proxy2:8080/nginx_status

networks:
  ai:
    name: ai
    driver: overlay
    external: true
  collaboration:
    name: collaboration
    driver: overlay
    external: true
  core:
    name: core
    driver: overlay
    external: true
  garage:
    name: garage
    driver: overlay
    external: true
  health:
    name: health
    driver: overlay
    external: true
  iot:
    name: iot
    driver: overlay
    external: true
  olivetin:
    name: olivetin
    driver: overlay
    external: true
  media:
    name: media
    driver: overlay
    external: true
  metrics:
    name: metrics
    driver: overlay
    external: true
  proxy:
    name: proxy
    driver: overlay
    external: true
  security:
    name: security
    driver: overlay
    external: true
  sprinklers:
    name: sprinklers
    driver: overlay
    external: true

If I run the following and get a container up and running on that "missing" network, I can get the container started with compose

docker run -dit --rm --name dummy-network-container --network ai alpine

Let me know if you need more info or want me to try something, I'm happy to help out and work on getting this fixed.

ndeloof commented 1 month ago

@tuxthepenguin84 could you please give binary from https://github.com/docker/compose/pull/12233 a try (binaries available on https://github.com/docker/compose/actions/runs/11513518822, at bottom) ?

This adds some debugs to the network resolution logic that will help diagnose this issue run as docker compose --verbose --progress=plain up

tuxthepenguin84 commented 1 month ago

Thanks I'll try that out and report back.

aek commented 1 week ago

@ndeloof I have the issue with the compose plugin version v2.27.0 running on Ubuntu Server 24.04 with ARM Arch

Here is the output of testing the binary from #12233

/etc/salt/docker/test # /etc/salt/docker/docker-compose-linux-aarch64 --verbose --progress=plain up -d
DEBU[0000] search network "axel5" by name returned: 0   
DEBU[0000] search network "axel5" by ID succeeded       
DEBU[0000] networks matching name "axel5" after strict filtering: 0 
DEBU[0000] no match, swarm is enabled: true             
 Container test-dummy-1  Recreate
DEBU[0005] otel error                                    error="<nil>"
 Container test-dummy-1  Recreated
 Container test-dummy-1  Starting
 Container test-dummy-1  Started
DEBU[0010] otel error                                    error="<nil>"
DEBU[0010] otel error                                    error="<nil>"

This version properly creates the network

Here is my docker info output

/etc/salt/docker/test # docker info
Client:
 Version:    26.1.5
 Context:    default
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 11
  Running: 6
  Paused: 0
  Stopped: 5
 Images: 13
 Server Version: 27.3.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: mi4aclsip2vfc0fmdk0lizvoi
  Is Manager: false
  Node Address: 172.31.41.5
  Manager Addresses:
   172.31.45.225:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 57f17b0a6295a39009d861b89e3b3b87b005ca27
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-1016-aws
 Operating System: Ubuntu 24.04.1 LTS
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 7.582GiB
 Name: ip-172-31-41-5
 ID: aebad7d3-d242-435a-a215-9e10a8a1a6b1
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Labels:
  salt-minion=dd6de55b-6f41-4cfd-924f-1231ed03995b
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

will try with the latest and report

aek commented 1 week ago

My issue was that I have 2 versions of docker compose:

I fix it by installing the latest from edge like this:

apk add docker-cli docker-cli-compose  --repository=https://dl-cdn.alpinelinux.org/alpine/edge/community