Closed philipdouglas closed 6 years ago
Do you use the same identical batch file for all windows machines? Is there any kind of error emitted during the bootstrap process and have you run with debug logging to see if that produces any clues?
Do you use the same identical batch file for all windows machines?
Unless knife is doing something differently that I am unaware of, yes.
Is there any kind of error emitted during the bootstrap process and have you run with debug logging to see if that produces any clues?
I get the following error when it tries to run the truncated command at the end of the batch file:
hostname C:\Users\Username> DEBUG: hostname[642275CA-20A1-4FA5-B1D1-19F9AE2D63EB] => :command_cleanup[cmd.exe /C "%TEMP%\bootstrap-9492-1453110190.bat"] DEBUG: hostname[642275CA-20A1-4FA5-B1D1-19F9AE2D63EB] => :shell_close ERROR: Failed to execute command on hostname return code 255
Even with verbose logging, there's no indication the batch script didn't write successfully.
haven't been able to duplicate this. Even installed the japanese language pack on a vm. If you are still able, would you mind pasting the contents of the .bat file to a gist? I don't believe there is anything sensitive in the stock bat files. If not, some things of interest are the actual number of bytes in the file.
Thanks for looking into this.
Windows says it's 15,237 bytes. Here's the file (with the certificates and encrypted data bag secret redacted): https://gist.github.com/FreakyDug/2ebdf8d81e9c35566671
I tried removing the --secret-file so the batch file would be slightly different. It did write more of the certificate but neither the file length or size stayed the same.
could you tell me the output of running the command chcp
from a console on this node?
C:\Users\USERNAME>chcp
Active code page: 932
I'm seeing the same issue, the bootstrap batch file cuts off at line 370 in the middle of echoing a certificate to file and the process fails with:
ERROR: Failed to execute command on xx.xx.xx.xx return code 255
output of chcp is:
Active code page: 437
version information:
> knife --version Chef: 12.5.1
> chef --version Chef Development Kit Version: 0.10.0 chef-client version: 12.5.1 berks version: 4.0.1 kitchen version: 1.4.2
Target node is Server 2012R2.
could both of you include the knife bootstrap commands you are using. It may help to see what knife args you are using. Thanks!
From my workstation I ran
knife bootstrap windows winrm xx.xx.xx.xx --winrm-user username --winrm-password 'password' --node-name mynode --run-list 'recipe[my_recipe_name]' --winrm-transport ssl --winrm-ssl-verify-mode verify_none
Just to make sure it wasn't anything with the certs or run lists I ran knife node create, assigned a run list, installed the chef client manually on the server, created a client.rb file referencing the same certificates that I manually copied to the server and running chef-client. The chef-client run completed successfully and accepted all of the certificates.
I run:
knife bootstrap windows winrm hostname -x domain\username -N hostname -r '''role[base-win]''' --environment=production
So it sounds like its not an issue with command length. Encoding is somewhat doubtful since @jedatwork has normal 437 but that does not strictly mean its still not an issue for @FreakyDug . It does sound like both of you run into problems during the rendering of the trusted certs.
If you run with debug logging (-VV
) it may be helpful to see the last few debug lines prior to:
DEBUG: hostname[642275CA-20A1-4FA5-B1D1-19F9AE2D63EB] => :command_cleanup[cmd.exe /C "%TEMP%\bootstrap-9492-1453110190.bat"]
DEBUG: hostname[642275CA-20A1-4FA5-B1D1-19F9AE2D63EB] => :shell_close
ERROR: Failed to execute command on hostname return code 255
Also @FreakyDug or @jedatwork if you are able to easily reproduce, shoot me an email (its in my profile) and I'd be happy to setup a skype/hangout/zoom.
Here's more of the debug logs, though it doesn't reveal much:
hostname C:\Users\username.DOMAIN>msiexec /qn /log "C:\Users\username.DOMAIN\AppData\Local\Temp\chef-client-msi12420.log" /i "C:\Users\username.DOMAIN\AppData\Local\Temp\chef-client-latest.msi"
hostname Successfully installed Chef Client package.
hostname A subdirectory or file C:\chef\trusted_certs already exists.
hostname Installation completed successfully
hostname Writing validation key...
hostname Validation key written.
hostname
hostname C:\Users\username.DOMAIN>mkdir C:\chef\trusted_certs
hostname
hostname C:\Users\username.DOMAIN>(
hostname echo.-----BEGIN CERTIFICATE-----
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.REDACTED
hostname echo.-----END CERTIFICATE-----
hostname ) 1>C:\chef/trusted_certs/chef.crt
hostname
hostname C:\Users\username.DOMAIN>
DEBUG: hostname[B82E221D-568E-45A5-8F0A-ED52511ADC43] => :command_cleanup[cmd.exe /C "%TEMP%\bootstrap-5128-1454323832.bat"]
DEBUG: hostname[B82E221D-568E-45A5-8F0A-ED52511ADC43] => :shell_close
ERROR: Failed to execute command on hostname return code 255
The main thing I'd like to understand from the debug logs is if the entire file is being "sent" to the machine and not being written or if the interruption is happening during transmission. I'm curious if there are debug logs showing the entirety of the trusted cert winrm command output.
Ah, I see. Yeah, the debug log shows it writing the whole batch file, including the whole certificate and the chef config and chef-client call at the end.
Thats good info. I guess the next thing to try is to find the very first chunk that failed to make it into the file. Copy the command from run_command
debug logging and run that in a command window on the node. Then see if it fails or if it appears just partially appended and troubleshoot from there.
I should have noticed this earlier but doing it manually revealed that the missing part is all of chunk 8. 1 to 7 get written correctly but there's nothing from 8.
When I ran the command for 8 manually cmd just printed "More?". I broke the command up and identified the problem line as:
cmd.exe /C echo Rendering "%TEMP%\bootstrap-3036-1454409234.bat" chunk 8 && >> "%TEMP%\bootstrap-3036-1454409234.bat" (echo.SET "PATH=%PATH%;C:\ruby\bin;C:\opscode\chef\bin;C:\opscode\chef\embedded\bin")
However, if you make it a standalone echo (without any &&s) it works fine:
cmd.exe /C echo >> "%TEMP%\bootstrap-3036-1454409234.bat" (echo.SET "PATH=%PATH%;C:\ruby\bin;C:\opscode\chef\bin;C:\opscode\chef\embedded\bin")
Both versions of the line work fine on my machine and the only difference I can see on the node is that windows prints "\" as "¥" because its set to Japanese.
Any ideas?
Doh, I found the problem. There was a rogue " in the node's PATH variable. I don't know if that's something chef can protect itself against, but it would definitely be helpful if it could report that it failed to write the batch file fully.
yeah agreed. So that backslash issue was not a problem?
No, my hunch there was wrong.
just curious exactly how your path was quoted. I've done a similar hack but get a different error code. Did your path have a single pair of double quotes?
Yeah, for some reason, there was a single " at the end of the PATH. I was able to reproduce the "More?" problem by just running:
cmd.exe /C echo test && (echo.""")
I ended up finding the same issue, the target node had an entry of
C:\Program Files\Amazon\cfn-bootstrap"
in the path and it was breaking the quoted string in the knife run. After removing from the path the client run seems to go on with no issues.
Definitely need to make the transmission of this file a smoother ride. Thanks a ton to both @FreakyDug and @jedatwork for the details provided
quotes around entries in the windows ENV PATH caused this issue for me as well. removed the quotes on the entries fixed the issue
Going to close this one as it's not really an issue with the software itself and we've narrowed down what causes this for posterity.
I am trying to bootstrap a windows machine over winrm using knife. The batch file that it writes to %TEMP% is being truncated at 370 lines, which is half way through one of the certificates and obviously fails to execute.
I have bootstrapped 40ish machines, most of them windows, without seeing this problem and I'm not doing anything different. Its the first Windows 8 machine I've tried, but I've successfully bootstrapped Windows 10 and Server 2012 R2. The only other difference is that the machine has the Japanese language pack installed.
Any ideas what's going on?