Open ugal1 opened 1 year ago
Some interesting insights here : https://github.com/docker/compose/issues/8856#issuecomment-1346200394
version: '3.8'
services:
busybox1:
image: busybox
command: "sleep 1d"
busybox2:
image: busybox
command: "sleep 1d"
depends_on: [busybox1]
Works fine, services stops correctly without deconnexion. This is a "working workaround", but it makes non sense to make it this way.
:eyes:
This is probably caused by https://github.com/docker/cli/pull/3900, which is included in latest release. Can you please confirm issue persists with docker compose v2.17.3 ?
Hi, I'm seeing this ocassionally as well with my own docker compose files using an SSH context. Even a simple compose file as this fails every now and then. https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-compose-yaml
What I've found or didn't realize earlier is that docker-compose uses a single SSH connection per defined service (aka for the attached compose file, it would be 16 individual connection).
The error I keep getting is the following, which service is failing is usually not repeating (in this example service-13
failed).
It is always failing with the following error stderr=kex_exchange_identification: read: Connection reset by peer
It can also happen on both docker-compose up
and docker-compose down
error during connect: Post "http://docker.example.com/v1.42/containers/create?name=playground-service-13-1": command [ssh -- homeserver docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer
Connection reset by 192.168.1.253 port 22
Information about my server: https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-information-homeserver
Information about my workstation: https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-information-workstation
@ndeloof I originally tested it with v2.17.3, updated today to v2.18.1 and it is unfortunately still happening.
Edit 1: After playing around with the workaround mentioned in https://github.com/docker/compose/issues/8856#issuecomment-1346200394 setting each service as depending on the previous service, I find the attached compose file working (with caveats). The reason this to be working is that docker-compose now only creates a new SSH connection when the previous one is finished.
Edit 2:
I played around a bit further and I'm finding that even while setting parallelism using docker-compose --parallel 4
, docker-compose still opens all SSH connections to the remote server for each service, meaning that in the linked docker-compose file with 16 services, it opens all 16 connections even while it should be waiting.
I have not been able to look further into this until now. The issue seen from sshd is that the connections are being trottled because too many connections are made within a small window.
Jun 07 20:37:42 homeserver sshd[2268386]: pam_unix(sshd:session): session closed for user husjon
Jun 07 20:37:42 homeserver sshd[493]: exited MaxStartups throttling after 00:00:01, 1 connections dropped
Jun 07 20:37:42 homeserver sshd[493]: error: beginning MaxStartups throttling
Jun 07 20:37:42 homeserver sshd[493]: drop connection #11 from [192.168.1.201]:49814 on [192.168.1.253]:22 past MaxStartups
Jun 07 20:37:42 homeserver sshd[2268377]: pam_unix(sshd:session): session closed for user husjon
Jun 07 20:37:42 homeserver sshd[2268525]: Connection closed by 192.168.1.201 port 49706 [preauth]
Jun 07 20:37:42 homeserver sshd[2268388]: pam_unix(sshd:session): session closed for user husjon
Jun 07 20:37:42 homeserver sshd[2268526]: ssh_dispatch_run_fatal: Connection from 192.168.1.201 port 49712: Broken pipe [preauth]
This is caused by the trottling feature in OpenSSH (https://man.openbsd.org/sshd_config#MaxStartups)
By increasing MaxStartups
in /etc/ssh/sshd_config
this issue is no longer happening, however at the cost of security.
The ssh's MaxStartups default config value (i.e. allow up to 10 concurrent unauthenticated ssh connections, and then randomly close any extra ones until there are 100 concurrent attempts, at which point hard reject the extra ones) thing should honestly be mentioned in docker compose
docs. Everyone starts googling a solution to this problem the moment they have more than 10 docker containers defined in their docker-compose.yml file.
I'm pasting my specific cli error, so that google may index it and the next person looking for it doesn't spend hours to find it:
unable to get image 'nginx:1.15-alpine': error during connect: Get "http://docker.example.com/v1.42/images/nginx:1.15-alpine/json": command [ssh -o ConnectTimeout=30 -l vagrant -- 192.168.33.100 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer
Docker Compose by nature runs multiple docker API calls concurrently and can indeed quickly reach the ssh limits.
Docker ssh support is actually implemented by running docker system dial-stdio
on remote host, and doesn't offer multiplexing so we could have multiple API calls over a single ssh connexion. I don't expect we get any short terms workaround to this limitation
@ndeloof
Thanks for explaining this.
Do you know whether it would be possible for Docker Compose to respect the --parallel [e.g. 1]
parameter and not initiate more ssh connection than that parameter? I'm just wondering what sysadmins could do if they didn't have the option of being able to increase the sshd limits in the sshd_config file.
Have you tried enabling ssh multiplexing on your client? his would allow docker compose
command to only rely on a single ssh session to access remote docker engine
see https://github.com/docker/compose/issues/8191#issuecomment-1448646228 for context
I tried to re-enable ssh multiplexing automatically enabled by docker CLI, but my PR fails for some non-obvious reason. Will need to wait for more eyes to help diagnose this :) https://github.com/docker/cli/pull/4699
@ndeloof I just tried with multiplexing enabled in my ssh config towards one of my docker nodes.
ControlPath ~/.ssh/controlmasters-%r@%h:%p
ControlMaster auto
ControlPersist 10m
Note: compared to the example from ssh multiplexing, I changed the path to ~/.ssh/controlmasters-%r@%h:%p
since the example requires the folder to exist before it works.
~I see that the server only receives 1 ssh connection as expected~ ~Correction: it seem like it still receives each connection individually.~
Correction 2: I was too quick to jump to conclusions, I hadn't switched the context. Using the example compose file below, it seem to work fine using multiplexing.
https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-compose-yaml
@ndeloof
I reverted the /etc/sshd_config::MaxStartups config change, and enabled ssh multiplexing using the .ssh/config snippet that husjon mentioned, and I can confirm that enabling ssh multiplexing for my "remote" docker machine "ssh hosts" works. As an added bonus: the docker compose up -d
seems to run marginally faster -- the docker containers reach the "Created" state all-at-once. This is most likely due to not having to go through the same ssh-handshaking over 10 times.
Bottom line is that the PR that you proposed is not only fixing the problem, but also makes things run in a more proper way (because one could argue that "spamming a remote sshd with logins" is not entirely proper due to resemblance to a minor ddos attack).
Have a good day.
I'd like to add that the multiplexing trick does not work on Windows since its OpenSSH implementation does not support the feature. Windows users can therefore only resort to the MaxStartups
server-side config.
I don't know how Docker Desktop users are doing, but in my case with remote Linux Engines this issue is highly impactful as it consistently breaks multi-container deployments.
Since switching to SSH authentication from certificates (RIP RancherOS), I noticed a substantial slowdown in all docker commands (both direct CLI and compose) including server-side SSH logs like kex_exchange_identification: connection reset by peer
. I don't know if it's a bug in Windows' OpenSSH implementation or Docker, but it's quite an inconvenience.
Is there a Docker-side code change that can resolve the situation? If so, can we expect to find it soon in an upcoming release? (maybe after #11165 is also fixed)
Description
Trying to "down" a compose over ssh will fail It needs to be two or more services inside the compose (one services works)
Similar issue here : https://github.com/docker/compose/issues/9185
Steps To Reproduce
1: docker-compose.yml
2: Run services, then stop
3: An error occurs
Compose Version
Docker Environment
Anything else?
No response