[BUG] Can not delete Container over ssh if there is more than one

ugal1 commented 1 year ago

Description

Trying to "down" a compose over ssh will fail It needs to be two or more services inside the compose (one services works)

Similar issue here : https://github.com/docker/compose/issues/9185

Steps To Reproduce

1: docker-compose.yml

version: '3.8'
services:

  busybox1:
    image: busybox
    command: "sleep 1d"

  busybox2:
    image: busybox
    command: "sleep 1d"

2: Run services, then stop

docker-compose -H ssh://USER@remote up -d && docker-compose -H ssh://USER@remote down

3: An error occurs

[+] Running 2/2
 - Container composev2-busybox2-1  Started                                                                            1.2s
 - Container composev2-busybox1-1  Started                                                                            2.8s
[+] Running 1/2
 - Container composev2-busybox2-1  Removed                                                                           10.5s
 - Container composev2-busybox1-1  Error while Removing                                                              11.8s
error during connect: Delete "http://docker/v1.41/containers/9ec831308cba7cd8c68a698c7e1db9985974033918f58def885d18ba9b9059e4?force=1": command [ssh -l USER -- docker-vm docker system dial-stdio] has exited with exit status 1, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

Compose Version

Docker version 20.10.12, build 20.10.12-0ubuntu4

Docker Compose version v2.14.0

Docker Environment

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 125
  Running: 6
  Paused: 0
  Stopped: 119
 Images: 90
 Server Version: 20.10.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.0-56-generic
 Operating System: Ubuntu 22.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 30.99GiB
 Name: hugues-tribu
 ID: OLST:TX4F:WU5G:FY2S:6OSC:HPJ6:E33R:UP6P:OAQU:74T7:VPR2:UEBB
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Anything else?

No response

ugal1 commented 1 year ago

Some interesting insights here : https://github.com/docker/compose/issues/8856#issuecomment-1346200394

ugal1 commented 1 year ago

version: '3.8'
services:

  busybox1:
    image: busybox
    command: "sleep 1d"

  busybox2:
    image: busybox
    command: "sleep 1d"
    depends_on: [busybox1]

Works fine, services stops correctly without deconnexion. This is a "working workaround", but it makes non sense to make it this way.

ugal1 commented 1 year ago

:eyes:

ndeloof commented 1 year ago

This is probably caused by https://github.com/docker/cli/pull/3900, which is included in latest release. Can you please confirm issue persists with docker compose v2.17.3 ?

husjon commented 1 year ago

Hi, I'm seeing this ocassionally as well with my own docker compose files using an SSH context. Even a simple compose file as this fails every now and then. https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-compose-yaml

What I've found or didn't realize earlier is that docker-compose uses a single SSH connection per defined service (aka for the attached compose file, it would be 16 individual connection).

The error I keep getting is the following, which service is failing is usually not repeating (in this example service-13 failed). It is always failing with the following error stderr=kex_exchange_identification: read: Connection reset by peer It can also happen on both docker-compose up and docker-compose down

error during connect: Post "http://docker.example.com/v1.42/containers/create?name=playground-service-13-1": command [ssh -- homeserver docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer
Connection reset by 192.168.1.253 port 22

Information about my server: https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-information-homeserver

Information about my workstation: https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-information-workstation

@ndeloof I originally tested it with v2.17.3, updated today to v2.18.1 and it is unfortunately still happening.

Edit 1: After playing around with the workaround mentioned in https://github.com/docker/compose/issues/8856#issuecomment-1346200394 setting each service as depending on the previous service, I find the attached compose file working (with caveats). The reason this to be working is that docker-compose now only creates a new SSH connection when the previous one is finished.

Edit 2: I played around a bit further and I'm finding that even while setting parallelism using docker-compose --parallel 4, docker-compose still opens all SSH connections to the remote server for each service, meaning that in the linked docker-compose file with 16 services, it opens all 16 connections even while it should be waiting.

husjon commented 1 year ago

I have not been able to look further into this until now. The issue seen from sshd is that the connections are being trottled because too many connections are made within a small window.

Jun 07 20:37:42 homeserver sshd[2268386]: pam_unix(sshd:session): session closed for user husjon
Jun 07 20:37:42 homeserver sshd[493]: exited MaxStartups throttling after 00:00:01, 1 connections dropped
Jun 07 20:37:42 homeserver sshd[493]: error: beginning MaxStartups throttling
Jun 07 20:37:42 homeserver sshd[493]: drop connection #11 from [192.168.1.201]:49814 on [192.168.1.253]:22 past MaxStartups
Jun 07 20:37:42 homeserver sshd[2268377]: pam_unix(sshd:session): session closed for user husjon
Jun 07 20:37:42 homeserver sshd[2268525]: Connection closed by 192.168.1.201 port 49706 [preauth]
Jun 07 20:37:42 homeserver sshd[2268388]: pam_unix(sshd:session): session closed for user husjon
Jun 07 20:37:42 homeserver sshd[2268526]: ssh_dispatch_run_fatal: Connection from 192.168.1.201 port 49712: Broken pipe [preauth]

This is caused by the trottling feature in OpenSSH (https://man.openbsd.org/sshd_config#MaxStartups)

By increasing MaxStartups in /etc/ssh/sshd_config this issue is no longer happening, however at the cost of security.

d-ph commented 10 months ago

The ssh's MaxStartups default config value (i.e. allow up to 10 concurrent unauthenticated ssh connections, and then randomly close any extra ones until there are 100 concurrent attempts, at which point hard reject the extra ones) thing should honestly be mentioned in docker compose docs. Everyone starts googling a solution to this problem the moment they have more than 10 docker containers defined in their docker-compose.yml file.

I'm pasting my specific cli error, so that google may index it and the next person looking for it doesn't spend hours to find it:

unable to get image 'nginx:1.15-alpine': error during connect: Get "http://docker.example.com/v1.42/images/nginx:1.15-alpine/json": command [ssh -o ConnectTimeout=30 -l vagrant -- 192.168.33.100 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer

ndeloof commented 10 months ago

Docker Compose by nature runs multiple docker API calls concurrently and can indeed quickly reach the ssh limits. Docker ssh support is actually implemented by running docker system dial-stdio on remote host, and doesn't offer multiplexing so we could have multiple API calls over a single ssh connexion. I don't expect we get any short terms workaround to this limitation

d-ph commented 10 months ago

@ndeloof

Thanks for explaining this.

Do you know whether it would be possible for Docker Compose to respect the --parallel [e.g. 1] parameter and not initiate more ssh connection than that parameter? I'm just wondering what sysadmins could do if they didn't have the option of being able to increase the sshd limits in the sshd_config file.

ndeloof commented 10 months ago

Have you tried enabling ssh multiplexing on your client? his would allow docker compose command to only rely on a single ssh session to access remote docker engine

see https://github.com/docker/compose/issues/8191#issuecomment-1448646228 for context

ndeloof commented 10 months ago

I tried to re-enable ssh multiplexing automatically enabled by docker CLI, but my PR fails for some non-obvious reason. Will need to wait for more eyes to help diagnose this :) https://github.com/docker/cli/pull/4699

husjon commented 10 months ago

@ndeloof I just tried with multiplexing enabled in my ssh config towards one of my docker nodes.

ControlPath ~/.ssh/controlmasters-%r@%h:%p
ControlMaster auto
ControlPersist 10m

Note: compared to the example from ssh multiplexing, I changed the path to ~/.ssh/controlmasters-%r@%h:%p since the example requires the folder to exist before it works.

~I see that the server only receives 1 ssh connection as expected~ ~Correction: it seem like it still receives each connection individually.~

Correction 2: I was too quick to jump to conclusions, I hadn't switched the context. Using the example compose file below, it seem to work fine using multiplexing.

https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-compose-yaml

d-ph commented 10 months ago

@ndeloof

I reverted the /etc/sshd_config::MaxStartups config change, and enabled ssh multiplexing using the .ssh/config snippet that husjon mentioned, and I can confirm that enabling ssh multiplexing for my "remote" docker machine "ssh hosts" works. As an added bonus: the docker compose up -d seems to run marginally faster -- the docker containers reach the "Created" state all-at-once. This is most likely due to not having to go through the same ssh-handshaking over 10 times.

Bottom line is that the PR that you proposed is not only fixing the problem, but also makes things run in a more proper way (because one could argue that "spamming a remote sshd with logins" is not entirely proper due to resemblance to a minor ddos attack).

Have a good day.

LaXiS96 commented 9 months ago

I'd like to add that the multiplexing trick does not work on Windows since its OpenSSH implementation does not support the feature. Windows users can therefore only resort to the MaxStartups server-side config. I don't know how Docker Desktop users are doing, but in my case with remote Linux Engines this issue is highly impactful as it consistently breaks multi-container deployments.

Since switching to SSH authentication from certificates (RIP RancherOS), I noticed a substantial slowdown in all docker commands (both direct CLI and compose) including server-side SSH logs like kex_exchange_identification: connection reset by peer. I don't know if it's a bug in Windows' OpenSSH implementation or Docker, but it's quite an inconvenience.

Is there a Docker-side code change that can resolve the situation? If so, can we expect to find it soon in an upcoming release? (maybe after #11165 is also fixed)

docker / compose