audiolize / vagrant-softlayer

This is a Vagrant plugin that adds a SoftLayer provider to Vagrant, allowing Vagrant to control and provision SoftLayer CCI instances.
MIT License
42 stars 15 forks source link

sl.post_install script sometimes fails to run to completion #54

Closed lonniev closed 9 years ago

lonniev commented 9 years ago

@ju2wheels I am debugging provisioning of Windows guests on SoftLayer and am trying to get the default Windows image ready for vagrant use through the use of a post_install script.

The particular script I am using is at https://gist.github.com/lonniev/7d967b09add6ca1f3a8a and it launches a powershell script to do the real work once it has elevated the privilege of the session to run powershell scripts.

My curiosity at the moment is why might SL fail to run the script to completion? I am seeing that the script (which runs to completion rapidly when run manually on a virgin guest after connecting through RDP) may stop midway and never complete.

Yesterday, the delays with DAL05 were atypically long. Is there some heartbeat monitoring between the host/remote vagrant session and the SL-local provisioning process where the post_install might be aborted if the vagrant session either gives up on the process or if the vagrant session has not responded to a "ping" from the SL session?

If not, I would expect that the SL guest would run the post_install to completion, quickly, without participating in any monitoring back to the vagrant session.

It may be that DAL05 had inside-to-outside network issues and the post_install script hung trying to pull in its chocolately installs. However, I would have expected to see something in the post_install log file about any (or most) abnormal terminations.

Not an issue yet, just a question.

lonniev commented 9 years ago
Exception calling "DownloadString" with "1" argument(s): "The underlying 
connection was closed: An unexpected error occurred on a send."
At line:1 char:1
+ iex ((new-object 
net.webclient).DownloadString('https://gist.githubusercontent.c ...
+ 
    + CategoryInfo          : NotSpecified: (:) [], MethodInvocationException
    + FullyQualifiedErrorId : WebException

This is one example of the issues with post installation. Meanwhile the awaiting "vagrant rebuild" or "vargrant up --provider=softlayer" process stalls waiting for Boot to finish.

Because the post_install fails (for issues not having to do with the script itself), crucial capabilities like the vagrant user and ssh and rsync don't get added and so vagrant is never going to get status until it balks and gives up.

Is there a way that vagrant can get notified that the post install process failed so that the recovery can occur sooner? Is there a way to learn why SL internally has such difficulty in performing the post installs reliably?

ju2wheels commented 9 years ago

Yesterday, the delays with DAL05 were atypically long. Is there some heartbeat monitoring between the host/remote vagrant session and the SL-local provisioning process where the post_install might be aborted if the vagrant session either gives up on the process or if the vagrant session has not responded to a "ping" from the SL session?

No, theres no heartbeat monitoring. After we issue an order we just keep polling for the transaction status of the machine and wait for completion of the order transaction and for the machine to be running. Theres no push notification to vagrant-sofltayer about the state, its always pull by us asking the API. vagrant-softlayer continues to wait for the order transaction completion and server running status until the server is ready or one of the vagrant-softlayer timeouts has been reached.

Is there a way that vagrant can get notified that the post install process failed so that the recovery can occur sooner? Is there a way to learn why SL internally has such difficulty in performing the post installs reliably?

The problem with the post_install is not one we can detect to avoid having to wait for the server to come up if the post_install fails (the only notification of that which im aware of is the notification emails).

If you are having problems with this your best bet is to open a ticket with SL support. The behavior of how the post_install works is one I havent seen good docs for. In particular, some of the questions brought up here and through other issue were:

  1. Is the post_install blocking/non blocking? Is the blocking/non blocking behavior different for OS type (Win/Linux) or even between different versions of the OS?
  2. Is the post_install on a SL API side timeout? ie How long will the SL API side wait for the post_install if it even does wait for it?

From my own testing with the Windows post_install a few months ago, I can tell you it fails just about 100% of the time on older Windows versions and sporadically on the newer ones. I had no luck with tickets that I opened to try to track down the issue. My edumacated guess as to why its so sporadic (being that I have 0 working knowledge of the SL API internals) is that the post_install is being run after Windows patches/updates have been applied but has not been restarted to allow those to finish and possibly leaving the .NET env and Windows in a wonky state.

lonniev commented 9 years ago

https://control.softlayer.com/support/tickets/17943221

lonniev commented 9 years ago

@ju2wheels if it is an automated execution of Windows Update that is shutting down the Windows instance while it is running the post_install script, would it be possible and effective to have a Windows box image that has Windows Update disabled?

If that would stop the apparent reboot during the post_install script, then it could be left to the user of the image to either call Windows Update manually or reenable the service once they want to be able to handle any reboots.

This may not be effective if (1) it isn't a Windows Update reboot that is causing the mysterious abort of the post_install or (2) SoftLayer forcibly runs a Windows Update on all its Windows instances whether or not the instance has the service enabled.

lonniev commented 9 years ago

I just checked the Windows Update history on one of the SL instances and it claims it was last updated "Never". That suggests that my hope that maybe Windows Update was a guilty troublemaker is pretty slim.

If not that, then why is the post installation script getting aborted so often?

ju2wheels commented 9 years ago

I dont think its Windows Update but rather some SL internal update/normalization process. If you enable vagrant debug logging and watch the transaction states as its going, IIRC there is a reboot at some point and then it does an update process (which I think is their attempt to install .NET components and other stuff that is not normally installed with base Windows installs as some of the older Windows versions have more recent .NET versions installed).

lonniev commented 9 years ago

Perhaps if I create a new SL instance, let it run entirely to an idle state, save that as a clone, and then use that image as the effective base box. I’d still like the opportunity to run a post install script but maybe SL would keep itself out of the process for that?

—Lonnie VanZandt

303-900-3048 Sent from Dropbox's Mailbox on Mac

On Mon, Apr 20, 2015 at 4:48 PM, Julio Lajara notifications@github.com wrote:

I dont think its Windows Update but rather some SL internal update/normalization process. If you enable vagrant debug logging and watch the transaction states as its going, IIRC there is a reboot at some point and then it does an update process (which I think is their attempt to install .NET components and other stuff that is not normally installed with base Windows installs as some of the older Windows versions have more recent .NET versions installed).

Reply to this email directly or view it on GitHub: https://github.com/audiolize/vagrant-softlayer/issues/54#issuecomment-94472818

lonniev commented 9 years ago

Sorry to be an SL tyro: how does one find the GUID for a Softlayer Standard Template Image? I hope to be able to find the GUID for the image I just made from the Windows Server that has been prepped for Vagrant use. However, I cannot find a way to get a GUID for the image whether the resource is a Private Image, Public Image, or exported ObjectStore resource. Where is the GUID revealed?

ju2wheels commented 9 years ago

Install the Softlayer Python cli tool and use that to run the cli tool:

pip install SoftLayer
sl image list
lonniev commented 9 years ago

Nothing is so easy. ;-)

I get to resolve this error now: https://github.com/softlayer/softlayer-python/issues/486

ju2wheels commented 9 years ago

Did it not add the sl or slcli command to your PATH ? or do you get that error using the CLI tool too? Not sure how that behaves on a Mac.

lonniev commented 9 years ago

python and pip were a hash of brew and macosx. I had to update, upgrade, and doctor brew. Then uninstall python, then install it, then link it. Then "sl" complained that it is obsolete and I should use "slcli" instead. Currently, slcli is complaining about the format of my .softlayer file, a file I haven't looked at for a year. Now trying to find out WTF it wants. ;-)

lonniev commented 9 years ago

slcli setup obviously regenerates the ~/.softlayer file. However, what is the right choice for "endpoint" and what happened to the setting for the domain name?

lonniev commented 9 years ago

Ah, I see what's happening: originally, I created ~/.softlayer to source in the SL_* environment variables and now this CLI app also wants to use the same filename. I can move mine to ~/.softlayer.env and let slcli have the .softlayer file.

Answering my own question, "public" is the right endpoint.

ju2wheels commented 9 years ago

Ive only ever had username, api_key, and endpoint_url in mine, no idea what the domain name was for. You can either backup your file and run slcli config setup and it will drop in the right value for endpoint_url.

lonniev commented 9 years ago

Perhaps I made it up? I reference it in the Vagrantfile to pass the default domain name into the CCI provisioner.

Anyway, in the end, slcli list image reveals the guids. Thank you, once again.

lonniev commented 9 years ago

@ju2wheels what is the relationship between cci.vm.box and sl.guid_image? How do we convince the provisioner that we want an image_guid and not a box image?

      #Note: If you use SL_GENERIC box you must set sl.image_guid or sl.operating_system/sl.dis_capacity, otherwise it is pre-set for you by the box

This might be better said that if you want to use a global image, then you can't set sl.operating_system and you must set the box to be SL_GENERIC.

However, this isn't enough. Something else is guiding the provisioner to think that sl.operating_system is set even though it is not explicitly set [in the Vagrantfile]. It may that I am trying to destroy an image that was created with a box after modifying the Vagrantfile for the image use. Perhaps it is reading machine metadata and is complaining.

lonniev commented 9 years ago

It is not the case that an existing machine from a prior configuration causes the conflict. Some setting in the Vagrantfile is either incorrect or missing.

Is there any code that says, "hey, the user wants, say, winrm. Therefore, let's set sl.operating_system to :windows for them." That would be bad.

lonniev commented 9 years ago

Explicitly setting sl.operating_system = nil allows the provisioning to carry on. Whether or not SL actually loads the requested global image remains to be seen.

lonniev commented 9 years ago

Great! That worked. Vagrant informed SL to provision the right image and it did it. The image has all the vagrantization performed on the bootstrapping instance with the intermittent post_install script.

Now, I need only get vagrant 1.7.3 to get the path-making fix for rsync. (Or that's all I know I need right now!)

lonniev commented 9 years ago

By the way, Windows Update definitely runs on the original image because the update history shows a boatload of Windows updates that are installed "today" and none others after the August 2014 build date in your image.