Open CollinChaffin opened 6 years ago
Thanks for the comprehensive report.
Just wanted to provide some additional info...I'm not ruling out an issue with OpenSSH-Win32, but there's a high probability that there's a problem with boot2docker specifically. If you turn an existing VM into a dockermachine (as opposed to running the boot2docker ramdisk) then this is not an issue:
# Launch Windows PowerShell 5.1 via 'Run as Administrator' and...
Get-Command ssh
CommandType Name Version Source
----------- ---- ------- ------
Application ssh.exe 7.7.1.0 C:\Program Files\OpenSSH-Win64\ssh.exe
$DockerMachineDir = "C:\Program Files\Docker\Docker\Resources\bin"
if ($($env:Path -split ";") -notcontains $DockerMachineDir) {$env:Path = $DockerMachineDir + ';' + $env:Path}
docker-machine version
docker-machine.exe version 0.14.0, build 89b8332
# Install the MiniLab PowerShell Module (full disclosure, I wrote it)
Install-Module MiniLab
Import-Module MiniLab
# Deploy CentOS 7 VM from https://app.vagrantup.com/centos/boxes/7
$VMName = "CentOS7Test"
$DeployHyperVVagrantBoxSplatParams = @{
VagrantBox = "centos/7"
VagrantProvider = "hyperv"
VMName = $VMName
VMDestinationDirectory = "H:\VirtualMachines"
Memory = 2048
CPUs = 1
}
$DeployCentOS7Result = Deploy-HyperVVagrantBoxManually @DeployHyperVVagrantBoxSplatParams
$CentOSIP = $DeployCentOS7Result.VMIPAddress
# (NOTE: The above 'Deploy-HyperVVagrantBoxManually' function downloads 'vagrant_unsecure_key' and
# 'vagrant_unsecure_key.pub' from https://github.com/hashicorp/vagrant/tree/master/keys and places it under "$HOME\.ssh")
ssh -o "StrictHostKeyChecking=no" -o "IdentitiesOnly=yes" -i "$HOME\.ssh\vagrant_unsecure_key" -t vagrant@$CentOSIP "sudo yum install net-tools -y"
docker-machine create --driver generic --generic-ip-address=$CentOSIP --generic-ssh-user vagrant --generic-ssh-key "$("$HOME\.ssh\vagrant_unsecure_key" -replace "\\","/")" $VMName
docker-machine.exe env $VMName
docker-machine.exe env $VMName | Invoke-Expression
@pldmgg - Thanks for also looking at it! I'll remind you though that the boot2docker IS THE BASIS of the Docker for windows/toolbox product and has been for years. Yes, you can take established containers and hack around and run (some) of them (I'm not convinced all) other than boot2docker, but it's somewhat irrelevant - bear with me. Just because you got one established non-boot2docker container you build elsewhere to accept a few ssh commands.....if you watch my reproduction of the issue you will see that 80% of my SSH commands with the faulty OpenSSH also worked. It's the 20% that didn't that made the product unusable. You can also make this product work 100% of the time (which is why this has been ignored for so long) if you simply don't run the OpenSSH for Windows implementation - which to me proves boot2docker is just fine. Not that there cannot be a small/weird parsing issue in boot2docker of the ssh strings or something, but my tests already prove there is a difference in something being sent specifically by OpenSSH vs other implementations - so I'm sure you agree the problem absolutely cannot ONLY be in boot2docker or you'd see it choke on other SSH clients and we don't.
Also, it really is amazing how much this bug is really affecting. After years of NEVER having Kitematic even run on Win7 w/Docker Machine because of "120 second timeout" and other timeout issues...also 100% reproducible....after 3 yrs now when I figured this out and swapped my path and hit enter without so much as a reboot, I was for the very first time ever on these systems (either with virtualbox or even vmworkstation driver) able to literally only move my mouse a few inches, close the unsuccessful error screen on Kitematic I've seen for years, re-open it, and literally less than 10 seconds later have a live Kitematic GUI finally up and running for the very first time. All because of my path pointing to a specific version of a ssh client of all things.
This is/was/has been a product killing bug that hopefully can finally be debugged down to the code defects and mitigated to ease all these docker-machine on Win issues related to it. I appreciate @pldmgg and anyone else helping to get to the root cause (I'm VERY surprised neither DEV team has even acknowledged the issue considering this severity and ease of reproduction). Probably most efficient now though to focus on the root cause in code and not spending too much time trying to determine what you still can pull off with the product with this bug in place. :)
@CollinChaffin In my opinion you should close this issue as it is actually a aggregation of many other possible issues entangled together. If the core of symptom is
Then I think I have traced down a possible reason for significant portion of people encountering such error; and the real core problem shall be solved from within docker-machine
instead.
Using the problematic command in your video example (docker-machine --debug env
), it actually executes ssh with multiple options behind the scene:
ssh.exe \
-F /dev/null \
-o ConnectionAttempts=3 \
-o ConnectTimeout=10 \
-o ControlMaster=no \
-o LogLevel=quiet \
-o ControlPath=none \
-o PasswordAuthentication=no \
-o ServerAliveInterval=60 \
-o StrictHostKeyChecking=no \
-o UserKnownHostsFile=/dev/null \
docker@127.0.0.1 \
-o IdentitiesOnly=yes \
-i <key_path>\id_rsa -p <port>
If Windows native OpenSSH is used, there is a specific combination of conditions causing error: requesting public-key-only authentication using a rejected key. Following command demonstrates the subset of options involved:
$ "\Program Files\OpenSSH-Win64\ssh.exe" -o PasswordAuthentication=no -o IdentitiesOnly=yes -i <key_path>\id_rsa docker@127.0.0.1 -p <port>
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions for '<key_path>\\id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "<key_path>\\id_rsa": bad permissions
docker@127.0.0.1: Permission denied (publickey,password,keyboard-interactive).
$ echo %ERRORLEVEL%
255
This, I believe, is the reason you have been encountering, and is definitely my reason. After fixing key permission on Windows, I can successfully perform docker-machine
operations with Windows OpenSSH, and become a happy user again.
OTOH, the reason some other ssh implementations work fine is only because they don't understand Windows ACL at all. For example, Git Bash SSH happily accepts the private key. This is not the fault of Windows OpenSSH. The responsibility of creating a working private key falls on docker, which traditionally has subpar support on Windows.
BTW there is no use kicking and screaming now, as docker toolbox is officially declared EOL'ed. It is better to try helping the remaining poor souls (read: Windows 7 users) instead of filing bugs on stuff that definitely won't be fixed.
My detailed write-up and video reproduction showing this incompatibility should more than answer the template questions including the versions. As you will see this is a rather mysterious issue that is very easily reproduced involving specifically OpenSSH-Win32 and Docker. It has been plaguing Docker users now for years and nobody (that I can find) ever could put their finger on the cause. Yesterday I opened this same issue over at Docker so I will simply paste it here in hopes between Docker and OpenSSH folks, someone can look through the code a bit more to determine just what is happening here:
Yesterday I tweeted and posted a video of the root cause of this (and almost every other Docker-Machine on Windows) error I have encountered.
Below I have posted a recap of the primary issue I experienced recently again and frankly for YEARS. This has been frustrating when all these Github issues seem to continually be erroneously closed to leave us having to perform a vast range of attempted workarounds without ever determining and addressing the root cause.
I personally have wasted hours probably totaling into the hundreds now troubleshooting this and similar various Docker on Windows issues and in all that time have never seen this resolution posted. It is certainly possible I missed it and if I did please feel free to point me to a dated write-up showing this as the root cause and I will absolutely stand corrected but otherwise I do believe this is the first time the actual root cause has been fully demonstrated with a solution.
Also of note: Please read to the end of this information because a common response may be that my findings are based on recent releases yet as I post below I can demonstrate this root cause has been in place for YEARS (I tested all the way back to the initial release of Docker and the results are the same!).
In my three decades in the industry, uncovering this one still felt pretty significant but then again when you've been banging your head against the wall for years on something, it usually does. :)
Background
There have been Github issues opened and subsequently closed dating all the way back to ISSUE #66 - very soon after the initial release yet closed. Many of these appear to be reproducible under this issue's cause or in some way related to this issue:
And the Docker forums:
And let's not forget all of StackOverflow sites (I'm only posting a couple but there are MANY):
Also, this issue's root cause is also responsible for all these Kitematic issues including:
And if that is not enough, just run these two Google searches below, and you can bet that most if not all those hits with these "mystery" 255/timeouts/nondescriptive are also this issue:
Issue(s)
Management of Docker using Docker-Machine on Windows is impossible using the native shells of Powershell and CMD under certain conditions.
Management of Docker using Docker-Machine on any OS may fail with recurring SSH errors.
Cause
Incompatible SSH client implementation.
Tests were run from the current beta release of
OpenSSH_for_Windows_7.6p1, LibreSSL 2.6.4 OpenSSH-Win32
all the way back toOpenSSH_7.1p1 Microsoft Pragma Win32 port Oct 7 2015, OpenSSL 1.0.2d 9 Jul 2015
all tested versions exhibited this behavior.Something in the call to "ip addr show" and possibly other operations from Docker-Machine are being interpreted incorrectly resulting in a terminating error. This prevents any successful Docker operations using either native shell in Windows.
This is becoming a bigger and bigger issue now since the incompatible client is automatically installed with Powershell and Chocolatey and added to the system path. The presence and priority in the system path can be changed and in such is the reason this issue does not appear to plague everyone on Windows and has been so difficult to troubleshoot.
Because the Mingwin bash shell relies on the separately installed Git SSH client, the QuickStart Terminal (usually) works which also had added to the difficulty in troubleshooting. However, once you move to native shells to manage the containers created with the QuickStart terminal, the system path will quickly prioritize the problematic SSH client and cause failures.
Based on testing I would recommend until full resolution that this could possibly be utilized as an accurate statement that sums up in one paragraph the current status of this issue:
Note that the real solution is (and I am also immediately opening a Github issue there as well) for the Powershell team now maintaining the OpenSSH for Windows project to work with the Docker team to capture the low-level debug output from OpenSSH back to Docker-Machine's call to "ip addr show" (and other commands) to determine what on earth is being interpreted as a bad return, despite as I demonstrated the command clearly returning successfully the exact same output as the versions using OpenSSL instead of LibreSSL, and the connection via TLS being successful in all versions.
Issue Reproduction
Workaround
Until the root cause of the incompatibility with the OpenSSH-Win32 client can be addressed, there is only one workaround that has been successfully tested.
Both the SYSTEM and USER environment variable for PATH must be edited and any references to OpenSSH-Win32 (Default path of
C:\Program Files\OpenSSH-Win64
) must be moved BELOW another compatible SSH client. For example, the Git installed version is compatible so if Git were installed the path ofC:\Program Files\Git\usr\bin
should be moved before the OpenSSH version.After making this change, a reboot is recommended.