cloudbase / cloudbase-init

Cross-platform instance initialization
http://openstack.org
Apache License 2.0
421 stars 149 forks source link

Sometimes Windows installation failed #14

Open jiaweien opened 5 years ago

jiaweien commented 5 years ago

On the openstack platform, create the virtual machine using the Windows image template processed by sysprep(I used the windows 2012 data center Chinese). sometimes cloudbase-init reboot the Windows after sethostname when setup.exe is running, although the probability of that happening is very small. At that time the SYSTEM\Setup\Status\SysprepStatus\GeneralizationState's value is already 7, but the setup of oobe still running, this causes the Windows installation to fail to continue with the exception, and the restart remains the same. I apologize for not keeping the log file, but please refer to the time in log file of sysprep and cloudbase-init for details.

ader1990 commented 5 years ago

Hello,

Can you open a bug report on https://bugs.launchpad.net/cloudbase-init and use the git review workflow to push a fix for the bug? Thank you. Let me know if you need help with using 'git review'.

This repo is just a mirror of https://opendev.org/x/cloudbase-init.git and the bug tracking/code patches are done using launchpad and review.opendev.org

ader1990 commented 5 years ago

@jiaweien have you tried using the registry keys in HKEY_LOCAL_MACHINE\SYSTEM\Setup\Status\ChildCompletion to achieve the same result? I think the process checking can be very brittle in case of newer Windows versions.

jiaweien commented 5 years ago

I'm sorry.Since I had other work to do, I didn't try to modify the registry to test it. But, when windows installation fails, I had press the shift+F10, open the terminal(I also looked at the cloudbase-init and setup logs in this way and learned that there might be some problems with the execution timing of cloudbase-init), input regedit for open the registry. modify the HKEY_LOCAL_MACHINE\SYSTEM\Setup\Status\ChildCompletion/setup.exe value to 3, and reboot the windows, can be accessed the system. The chance of this happening again is extremely low, and I've been testing it ever since, but so far I haven't had one. If I run into it again, I'll upload the screenshot. 发自我的华为手机-------- 原始邮件 --------主题:Re: [cloudbase/cloudbase-init] Sometimes Windows installation failed (#14)发件人:Adrian Vladu 收件人:cloudbase/cloudbase-init 抄送:jiaweien 15321830303@163.com,Mention @jiaweien have you tried using the registry keys in HKEY_LOCAL_MACHINE\SYSTEM\Setup\Status\ChildCompletion to achieve the same result? I think the process checking can be very brittle in case of newer Windows versions.

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

jiaweien commented 5 years ago

Hello    I'll do it later. Thanks for your remind. 发自我的华为手机-------- 原始邮件 --------主题:Re: [cloudbase/cloudbase-init] Sometimes Windows installation failed (#14)发件人:Adrian Vladu 收件人:cloudbase/cloudbase-init 抄送:jiaweien 15321830303@163.com,Author Hello, Can you open a bug report on https://bugs.launchpad.net/cloudbase-init and use the git review workflow to push a fix for the bug? Thank you. Let me know if you need help with using 'git review'.

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or mute the thread.

jiaweien commented 4 years ago

@ader1990 Someone told me recently that he had the same problem,I haven't come up with a better solution yet except process checking. So , could you review it?

ader1990 commented 4 years ago

@jiaweien I understand the issue here, but we need to find a more reliable way to fix this problem, as looking for processes with a hardcoded executable (like setup,exe) can bring to an infinite loop or failure if the admin installed some service that has setup.exe in the path.

If I make a patch using HKEY_LOCAL_MACHINE\SYSTEM\Setup\Status\ChildCompletion to test the issue, could you test it?

jiaweien commented 4 years ago

@ader1990 Okay,I'll be test it later, and tell you the result when it's over.

jiaweien commented 4 years ago

@ader1990 During this period, I tested Windows server 08 R2 and Windows server 2012 R2 several times(both are in chinese). I checked whether the registry value of SYSTEM\Setup\Status\SetupFinalTasks was 3 to determine whether the initialization had been completed.So far, there have been no initialization failures.But I'm not sure if I haven't tested enough or if I've fixed the problem.I also included my changes in the patches I submitted,I wonder if that's what you suggest.

jiaweien commented 4 years ago

@ader1990 test-code cloudbase-init-log Just now, I got this log from my test code.In this initialization, when the GeneralizationState value is 7, there are approximately 3 seconds or so after the SetupFinalTasks value is not 3. This should indicate that the initialization work can be determined by checking the status of SetupFinalTasks.

ader1990 commented 4 years ago

On a Windows 10, this is what these registry keys contain:

Get-ChildItem -Path HKLM:\system\Setup\Status\

    Hive: HKEY_LOCAL_MACHINE\system\Setup\Status

Name                           Property
----                           --------
ChildCompletion                setup.exe       : 3
                               oobeldr.exe     : 3
                               SetupFinalTasks : 3
SysprepStatus                  GeneralizationState : 7
UnattendPasses                 oobeSystem : 2

I need to check if all Windows versions have these keys to make sure there is no regression when we check the child completion keys.

ader1990 commented 4 years ago

@jiaweien Here is a suggestion to improve the code (needs refactor or a better placement of code though).

I would prefer putting the ChildCompletion check just before the plugin reboot as from my understanding, you get this error only if you run the SetHostname Plugin in the normal cloudbase-init run. The cloudbase-init installer configures the SetHostname Plugin to run during the Sysprep specialize step (which is the recommended way).

Your current code will hang forever if cloudbase-init is installed using the default installer from cloudbase.it . The reason is that during the unattend phase, the value of the subkeys from ChildCompletion are: SetupFinalTasks = 0 and setup.exe = 1.

    def wait_for_boot_completion(self):
        try:
            with winreg.OpenKey(winreg.HKEY_LOCAL_MACHINE,
                                "SYSTEM\\Setup\\Status\\SysprepStatus", 0,
                                winreg.KEY_READ) as key:
                while True:
                    gen_state = winreg.QueryValueEx(key,
                                                    "GeneralizationState")[0]
                    if gen_state == 7:
                        break
                    time.sleep(0.1)
                    LOG.info('Waiting for sysprep completion. '
                             'GeneralizationState: %d', gen_state)

            with winreg.OpenKey(winreg.HKEY_LOCAL_MACHINE,
                                "SYSTEM\\Setup\\Status\\ChildCompletion", 0,
                                winreg.KEY_READ) as key:
                steps = 0
                while steps < 1000:
                    setup_state = winreg.QueryValueEx(key,
                                                      "setup.exe")[0]
                    setup_final_state = winreg.QueryValueEx(key,
                                                            "SetupFinalTasks")[0]
                    if setup_state == 3 and setup_final_state == 3:
                        LOG.info('Sysprep entered desired mode for service run.')
                        break
                    if setup_state == 1 and setup_final_state == 0:
                        LOG.info('Sysprep entered desired mode for unattended run.')
                        break
                    time.sleep(0.1)
                    LOG.info('Waiting for sysprep completion. '
                             'setup_state: %(setup_state)d and setup_final_state: %(setup_final_state)d', {"setup_state": setup_state, "setup_final_state": setup_final_state})
                    steps = steps + 1
        except WindowsError as ex:
            if ex.winerror == 2:
                LOG.debug('Sysprep data not found in the registry, '
                          'skipping sysprep completion check.')
            else:
                raise ex
jiaweien commented 4 years ago

@ader1990 Maybe make the cloudbase-init installer configures the SetHostname Plugin to run during the Sysprep specialize step is the best choice. The purpose of my current code is to wait until the setup process and its child processes (which contain the parsing and execution of unattention.xml) are complete before starting the cloudbase-init task. Also, I'm not sure if the code you posted will cause the previous error. For example, in my previous screenshot, when the SYSTEM\Setup\Status\SysprepStatus\GeneralizationState is 7, there are nearly 3 seconds left for the Setup to be truly complete. If cloudbase-init executes to sethostname when setup_state == 1 and setup_final_state == 0 are set, the previous error may be thrown.

ader1990 commented 4 years ago

When using the cloudbase-init MSI installer, the installer has an option to sysprep the machine at the end of the installation. By default, it configures SetHostname Plugin to run during the Sysprep specialize step.

What my code does is to handle two scenarios:

  1. when cloudbase-init runs during specialize phase, it waits for the GeneralizationState=7, setup.exe = 1 and SetupFinalTasks = 0 (as the childcompletions are not ready at that moment). The reboot is performed by specialize step thorugh the special instruction of the RunSychronousComand - WillReboot https://github.com/cloudbase/cloudbase-init-installer/blob/master/CloudbaseInitSetup/Unattend.xml#L26
  2. when cloudbase-init runs at each boot as a service, it waits for GeneralizationState=7, setup.exe = 3 and SetupFinalTasks = 3 (as the childcompletions should be finished before performing other tasks like reboot, user scripts, etc).
jiaweien commented 4 years ago

I'm very sorry. I may have misunderstood you.I thought you wanted to use your code without refactoring. My idea is to complete the hostname setting by modifying unattend.xml, adding sethostname task in unattend.xml, and configuring willreboot as request to complete the hostname update. Is that what you said about changing the sethostname process? This way, users who want to customize unattend.xml (I used my own generated unattend.xml) will need to add the custom part to the unattend.xml in the cloudbase-init installer directory before proceeding to the last step, sysprep. I tried it, and it worked.

ader1990 commented 4 years ago

Hello @jiaweien,

Did you manage to solve this issue per my suggestion from above or did you use a special unattedn.xml? Do you still need some code change in Cloudbase-Init?

Thank you, Adrian Vladu

jiaweien commented 4 years ago

@ader1990 I'm sorry. I've been busy lately. I just tried to modify the architecture of the way what you said the last time , in the unattend.xml file before execution cloudbase-init added a step (my test scripts, not sethostname), it is can be done before the start of the cloubase-init (I reboot the machine in my test script, I saw the completion of sysprep log than cloudbase-init start logging time sooner). My idea is that, would be replace my script with sethostname, setting the hostname and restarting it before the other steps in cloudbase-init are executed. However, if this is true, it may be worth noting that sethostname may require a network to retrieve the hostname from the metadata.

jiaweien commented 3 years ago

Hello @ader1990 , A lot has been going on lately, so...Sorry for the late reply

You're right, my code will be hang forever when using the cloudbase-init MSI installer, the installer has an option to sysprep the machine at the end of the installation.. image image

And I also tested your code, I have tested it many times and it always has completed the initialization of the virtual machine. image image

But I still wonder if these changes only reduce the likelihood of initialization failure, not prevent it altogether.Because if I choose to use the cloudbase-init service to initialize the server, will there still be GeneralizationState=7, setup.exe = 1 and SetupFinalTasks = 0 at some point-in-time, and then restart the server directly after sethostname plugin completes?(But that's just my suspicion, and it didn't happen during my tests).

> if setup_state == 1 and setup_final_state == 0:
>      LOG.info('Sysprep entered desired mode for unattended run.') 
>      break

Also, I am not sure if I can add a judgment on the existence of the "Cloudbase-init-unattend.log" file to determine whether Cloudbase-init is initiated by the service or by unattended(Because I see that when Cloudbase-init is initiated by unattended. it will log to "Cloudbase-init-unattended. log"). image image

eg.

if GeneralizationState=7:
    while steps < 1000:
        if C:\Program Files\Cloudbase Solutions\Cloudbase-Init\log\cloudbase-init-unattend.log is exist:
            if setup.exe = 1 and SetupFinalTasks = 0:
                break
        else:
             if setup_state == 3 and setup_final_state == 3:
                break
        time.sleep(0.1)
        steps = steps + 1

Do you think this is better? Or that it may not be necessary at all. I'd like to hear your thoughts.

Thank you, jiaweien

ader1990 commented 3 years ago

Hello @jiaweien ,

You can have a check of the config options that are set in the unattend conf, like https://github.com/cloudbase/cloudbase-init-installer/blob/master/CloudbaseInitSetup/Actions/ConfFileActions.js#L86, but that is a brittle and might change in the future. The best course of action would still be the registry keys check, as those keys are reliable source of identifying in what stage cloudbase-init is run. Still, checking that stop_service_on_exit is false can reliably tell that cloudbase-init is not running in normal mode, under a Windows service, but this change cannot be included in the upstream codebase, whereas the code I shared can and should be 100% reliable.

In this case, I will create a patch with those checks and would be glad if you could test the new MSI.

Thank you, Adrian Vladu

jiaweien commented 3 years ago

Hello @ader1990 ,

Yes, I also noticed allow_reboot is false in cloudbase-init-unattend.conf when I was testing earlier. In my custom virtual machine image, I manually sysprep after cloudbase-init is installed(C:\windows\system32\sysprep.exe /generalize /oobe /shutdown), instead of selecting Cloudbase-init.msi to complete sysprep and initialize it to the virtual machine image template. This will allow me to have my own custom unattend step, and also not affect the execution of Cloudbase-init as a service. So I'm not worried about Cloudbase-init running in unattended-mode, just not known if Cloudbase-init running in service mode can completely avoid this problem?

I am looking forward to your new Cloudbase-init.msi and I will test it again then.

Thank you jiaweien