docker / machine

Machine management for a container-centric world
https://docs.docker.com/machine/
Apache License 2.0
6.62k stars 1.97k forks source link

RHEL 7.2 provisioning fails on cloud drivers #3180

Open ahmetb opened 8 years ago

ahmetb commented 8 years ago

It all started a couple of months ago when we first noticed some connectivity issues to port 2376 in Azure on redhat enterprise linux images –and this was outside docker-machine. However I can clearly reproduce this error on Google Compute Engine as well and since it is an already released driver I will be providing repro for it.

GCE repro steps

docker-machine -D create -d google  --google-project awesome-azure --google-machine-image 'https://www.googleapis.com/compute/v1/projects/rhel-cloud/global/images/rhel-7-v20160303' rhel-gce
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with redhat...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Error running SSH command: exit status 127
Error running SSH command: exit status 127
Error running SSH command: exit status 127
Error running SSH command: exit status 127
Error running SSH command: exit status 127
Error running SSH command: exit status 127
Error running SSH command: exit status 127
Error running SSH command: exit status 127
Error running SSH command: exit status 127
Error running SSH command: exit status 127

This will fail trying to ssh Error running SSH command: exit status 127. Here is the debug output: https://gist.github.com/ahmetalpbalkan/7776a1fb01e36681797d

When I go to VM I can see docker running just fine:

➜  ~ dm ssh rhel-gce
[docker-user@rhel-gce ~]$ ps aux | grep docker
root      2001  0.1  0.6 545960 24816 ?        Ssl  00:06   0:00 /usr/bin/docker daemon -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver devicemapper --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label provider=google
[docker-user@rhel-gce ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.2 (Maipo)

and it works just fine:

➜  ~ eval $(dm env rhel-gce)
➜  ~ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Azure repro steps

Weird enough, Azure does fail just a bit differently, but again, Docker is installed and running on port :2376 but this time, I cannot reach out to this machine using env.

(with the new driver at https://github.com/docker/machine/pull/3159)
$ dm -D create -d azure --azure-image 'redhat:rhel:7.2:latest' rhel
Detecting the provisioner...
Provisioning with redhat...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Checking connection to Docker...
Error creating machine: Error checking the host: Error checking and/or regenerating the certs: There was an error validating certificates for host "23.101.192.45:2376": dial tcp 23.101.192.45:2376: i/o timeout
$ dm env rhel
Error checking TLS connection: Error checking and/or regenerating the certs:
  There was an error validating certificates for host "23.101.192.45:2376":
  dial tcp 23.101.192.45:2376: i/o timeout
You can attempt to regenerate them using 'docker-machine regenerate-certs [name]'.
Be advised that this will trigger a Docker daemon restart which will stop running containers.
➜  ~ dm ssh rhel
Last login: Thu Mar 10 23:52:16 2016 from c-67-168-133-5.hsd1.wa.comcast.net
[docker-user@rhel ~]$ ps aux | grep docker
root      4099  0.1  0.6 477508 22440 ?        Ssl  23:52   0:00 /usr/bin/docker daemon -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver devicemapper --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label provider=azure

Findings so far

  1. It doesn't happen on Ubuntu, CoreOS, CentOS; so it may be just RHEL.
  2. It's not just Azure, it happens on GCE too, just manifests itself differently.
  3. Apps running on other ports (e.g. port 80) work just fine. There's something about :2376.
  4. This is not just docker-machine issue. We can repro this in azure-docker-extension as well.
  5. azure: telnet ip 2376 works but docker client doesn't (it hangs).
ahmetb commented 8 years ago

Is there a way to get a hold of a RedHat representative to take a look at this issue? It is clearly broken and easily reproducible.

ahmetb commented 8 years ago

So there's no rhel people on githubz I guess ;-)

ahmetb commented 8 years ago

well nobody seems to be giving a damn. so closing.

runcom commented 8 years ago

:/ Ahmet can you re-open this? - I'll try to understand more but I never played with docker-machine

sferich888 commented 8 years ago

Is this a duplicate of https://github.com/docker/machine/issues/2480 (does installing netstat) on the image help resolve this?

If so this might just be an issue where RHEL on these cloud platform does not have this package installed by default, and would need to be added.