actions / runner-images

GitHub Actions runner images
MIT License
10.14k stars 3.05k forks source link

Windows 2019 Sysprep forever #964

Closed davemedvitz closed 4 years ago

davemedvitz commented 4 years ago

Trying to build a 2019 image from win19/20200524.1, Modified for bug #940

Build goes to it;s last step, runs Sysprep than continues to log

IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE

every ten seconds. For the last 11 hours.

We also tried to go back to the previous release ( not tagged as pre-release) win19/20200517.1. This release also does not build due to bug #898 .

Virtual environments affected

Expected behavior Packer build to complete with a viable VM image

Actual behavior

The Windows2019 packer template was adjusted to remove the Install-IEWebDriver and Validate-SeleniumWebDrivers steps, then run. This is executed from cloud-init on a temporary build box.

log from serial console

[17042.562839] cloud-init[2079]: ==> vhd: Provisioning with Powershell... [17042.570186] cloud-init[2079]: ==> vhd: Provisioning with powershell script: /tmp/powershell-provisioner037029381 [17042.570784] cloud-init[2079]:  vhd: IMAGE_STATE_COMPLETE [17042.573014] cloud-init[2079]:  vhd: IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE [17044.073317] cloud-init[2079]:  vhd: IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE . . .

[56831.770690] cloud-init[2079]:  vhd: IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE [56841.785452] cloud-init[2079]:  vhd: IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE

I have validated that the sysprep command contains the /generalize switch, that block hasn't been touched by any processing.

Darleev commented 4 years ago

Hello @davemedvitz, I found several similar issues: https://github.com/hashicorp/packer/issues/8428 https://github.com/hashicorp/packer/issues/8929 We are able to build the image without any issue, it looks like temporary issue with Azure Services. Which version of the packer did you for the build? Could you please try to build the image with packer version 1.5.6? We are looking forward to your reply.

davemedvitz commented 4 years ago

I am on 1.5.5 Will update, rerun, and report

Thank you Dave

davemedvitz commented 4 years ago

Reran using packer 1.5.6 with the same results.

Having read the packer issues you mentioned, I've added boot diagnostics to the packer config and am running again, with the intent of determining if sysprep is hung, or if , as one of the suggestions indicated, if a reboot was pending, or something else.

Dave

miketimofeev commented 4 years ago

@davemedvitz we have a similar issue with some tips that can be useful for you as well https://github.com/actions/virtual-environments/issues/507

aledeniz commented 4 years ago

@Darleev I am on 1.5.5, I am experiencing the same intermittent issue reported by @davemedvitz.

Let's say 20% to 30% of my Windows Server 2019 builds are wasted because of the dreaded IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE issue. I have mitigated it through a couple of restart before the sysprep. It is a pity, because otherwise the specific build wouldn't be flacky at all.

aledeniz commented 4 years ago

Also, @Darleev, you find a lot of closed issue because that stupid HashiBot is closing the tickets too early. Closing the tickets too early is not really going to make disappear the issues ..

aledeniz commented 4 years ago

@davemedvitz your ticket prompted me to look at my huge collection of build logs related to this issue, and I may have spotted something. I have put a "choco list -li" before the last restart before the sysprep, and when I get the dreaded issue, it is reporting 1 application more:

Microsoft Monitoring Agent|10.20.18038.0

I think those subscriptions are actually configured to push that to VMs via a VM extension. I wonder if I should put 2 reboots before the generalisation, or perhaps trying to find a way to exclude the resource group where packer is generating the VM from being pushed those extensions (there may be other ones ..).

aledeniz commented 4 years ago

@davemedvitz 2 extensions got pushed during a build I started after reading your ticket, I can see both in the resource group, and in the activity log blade (look for "'deployIfNotExists' Policy action" items). Now, this time these extension were pushed pretty early (there are still a couple of reboot to go before reaching sysprep), finger crossed .. this is what I see in the extension blade of the VM packer is building:

AzurePolicyforWindows Microsoft.GuestConfiguration.ConfigurationforWindows 1.* Provisioning succeeded MicrosoftMonitoringAgent Microsoft.EnterpriseCloud.Monitoring.MicrosoftMonitoringAgent 1.* Provisioning succeeded

aledeniz commented 4 years ago

Failed .. looking at C:\windows\System32\Sysprep\Panther\setuperr.log, I can read:

2020-05-30 23:52:38, Error [0x0f0082] SYSPRP LaunchDll:Failure occurred while executing 'DscCore.dll,SysPrep_Cleanup', returned error code 0x2 2020-05-30 23:52:38, Error [0x0f0070] SYSPRP RunExternalDlls:An error occurred while running registry sysprep DLLs, halting sysprep execution. dwRet = 0x2[gle=0x00000006] 2020-05-30 23:52:38, Error [0x0f00ae] SYSPRP WinMain:Hit failure while processing sysprep cleanup external providers; hr = 0x80070002[gle=0x00000006]

aledeniz commented 4 years ago

@davemedvitz my last provisioner is currently looking like this (I haven't yet looked into the removal of the Defender Windows feature ..):

{
    "type": "powershell",
    "inline": [
        "if (((Get-Item -LiteralPath 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration' -ErrorAction SilentlyContinue).PSObject.Properties -ne $null) -and ((Get-ItemProperty 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration').PSObject.Properties.Name -contains 'AgentId')) {Set-ItemProperty -path 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration' -Name 'AgentId' -value ''}",
        "if((Get-Service | ? Name -eq RdAgent).count -gt 0) {while ((Get-Service RdAgent).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if((Get-Service | ? Name -eq WindowsAzureTelemetryService).count -gt 0) {while ((Get-Service WindowsAzureTelemetryService).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if((Get-Service | ? Name -eq WindowsAzureGuestAgent).count -gt 0) {while ((Get-Service WindowsAzureGuestAgent).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if( Test-Path $Env:SystemRoot\\System32\\Sysprep\\unattend.xml ){ rm $Env:SystemRoot\\System32\\Sysprep\\unattend.xml -Force}",
        "& $env:SystemRoot\\System32\\Sysprep\\Sysprep.exe /oobe /generalize /quiet /quit /mode:vm",
        "$attempts=0; while($true) { if ($attempts -gt 10) {break}; $imageState = Get-ItemProperty HKLM:\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Setup\\State | Select ImageState; if($imageState.ImageState -ne 'IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE') { Write-Output $imageState.ImageState; if($imageState.ImageState -eq 'IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE'){$attempts++}; Start-Sleep -s 10  } else { Write-Output $imageState.ImageState; break } }"            ]
}

caveat: this is a work in progress!

E.g. note that if the last while breaks because $attempts is greater than 10, the resulting image is not generalised. I am wondering if failing the build, or attempt to rearm after a reboot.

For the Hashibot masters: my view is that the above deserves its own Packer provisioner. It's nonsense that thousands of Packer users must figure out among themselves how to do that again and again. It is by far the flakiest step of the Packer Windows builds. Microsoft is your partner, sit down or have a call with them, and get us a provisioner to generalize our images.

aledeniz commented 4 years ago

@davemedvitz I am working to create an imagine by close of play today, and the above has worked enough for me to complete my specific task (well, at least I hope so, I haven't yet tested the image :) ). Problem is in the last generalised image, I didn't have the Microsoft Monitoring Agent extension pushed by some subscription policy (in my scenario, when the build is reporting 7 applications not managed by choco it has always generalised correctly, while I have mixed results when the applications are 8, e.g. when the Microsoft Monitoring Agent has been deployed by an extension executed by a policy).

The script above does nicely report IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE, which can be useful in automation scenarios (for the Hashibot masters: it's still improvable, but it is better than the one you published in your documentation; and you still owe us a provisioner to take care of the generalisation):

    azure-arm: 7 applications not managed with Chocolatey.
    azure-arm:
    azure-arm: Did you know Pro / Business automatically syncs with Programs and
    azure-arm:  Features? Learn more about Package Synchronizer at
    azure-arm:  https://chocolatey.org/compare
==> azure-arm: Restarting Machine
==> azure-arm: Waiting for machine to restart...
==> azure-arm: A system shutdown is in progress.(1115)
==> azure-arm: #< CLIXML
    azure-arm: pkrvm9mpv3q8wus restarted.
==> azure-arm: <Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04"><Obj S="progress" RefId="0"><TN RefId="0"><T>System.Management.Automation.PSCustomObject</T><T>System.Object</T></TN><MS><I64 N="SourceId">1</I64><PR N="Record"><AV>Preparing modules for first use.</AV><AI>0</AI><Nil /><PI>-1</PI><PC>-1</PC><T>Completed</T><SR>-1</SR><SD> </SD></PR></MS></Obj></Objs>
==> azure-arm: Machine successfully restarted, moving on
==> azure-arm: Pausing 1m0s before the next provisioner...
==> azure-arm: Provisioning with Powershell...
==> azure-arm: Provisioning with powershell script: C:\Users\rioloa\AppData\Local\Temp\powershell-provisioner727123571
    azure-arm: IMAGE_STATE_COMPLETE
    azure-arm: IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE

When I next I will look at this, I will try to re-arm after uninstalling Defender, without a reboot, something like this (perhaps better to target $env:ProgramData for the rearm.do file, and yes, it does warrant its own script):

{
    "type": "powershell",
    "inline": [
        "if (((Get-Item -LiteralPath 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration' -ErrorAction SilentlyContinue).PSObject.Properties -ne $null) -and ((Get-ItemProperty 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration').PSObject.Properties.Name -contains 'AgentId')) {Set-ItemProperty -path 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration' -Name 'AgentId' -value ''}",
        "if((Get-Service | ? Name -eq RdAgent).count -gt 0) {while ((Get-Service RdAgent).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if((Get-Service | ? Name -eq WindowsAzureTelemetryService).count -gt 0) {while ((Get-Service WindowsAzureTelemetryService).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if((Get-Service | ? Name -eq WindowsAzureGuestAgent).count -gt 0) {while ((Get-Service WindowsAzureGuestAgent).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if( Test-Path $Env:SystemRoot\\System32\\Sysprep\\unattend.xml ){ rm $Env:SystemRoot\\System32\\Sysprep\\unattend.xml -Force}",
        "& $env:SystemRoot\\System32\\Sysprep\\Sysprep.exe /oobe /generalize /quiet /quit /mode:vm",
        "$attempts=0; while($true) { if ($attempts -gt 10) {Uninstall-WindowsFeature -Name Windows-Defender;New-Item -Path 'C:\\ProgramData' -Name rearm.do -ItemType File -Force; break}; $imageState = Get-ItemProperty HKLM:\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Setup\\State | Select ImageState; if($imageState.ImageState -ne 'IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE') { Write-Output $imageState.ImageState; if($imageState.ImageState -eq 'IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE'){$attempts++}; Start-Sleep -s 10  } else { Write-Output $imageState.ImageState; break } }",
        "if(Test-Path 'C:\\ProgramData\\rearm.do' -PathType Leaf){& $env:SystemRoot\\System32\\Sysprep\\Sysprep.exe /oobe /generalize /quiet /quit /mode:vm}",
        "if(Test-Path 'C:\\ProgramData\\rearm.do' -PathType Leaf){$attempts=0; while($true) { if ($attempts -gt 10) {break}; $imageState = Get-ItemProperty HKLM:\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Setup\\State | Select ImageState; if($imageState.ImageState -ne 'IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE') { Write-Output $imageState.ImageState; if($imageState.ImageState -eq 'IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE'){$attempts++}; Start-Sleep -s 10  } else { Write-Output $imageState.ImageState; break }}}"
    ],
    "pause_before": "1m"
}

Failing the above, then I will try to figure out how to move the second re-arm after a shutdown (I used a file as flag for that), but I will have to make the reboot and the provisioner conditional (e.g. if the first generalisation worked fine).

Please note that AFAIK an image can be rearmed only thrice (and I suspect Microsoft may have already fired the first shot, as I am using one of their marketplace image as baseline).

aledeniz commented 4 years ago

@davemedvitz I had configured packer to perform its builds on a specific resource group (I gave it contributor rights only to 2 resources groups, the one for the builds and the one for the images).

I have now also updated the policy, excluding the resource group for the builds. The policy name is something like "Enable Monitoring in Azure Security Center".

aledeniz commented 4 years ago

@davemedvitz at the end, it seems that excluding the resource group from the policy assignment did the deed. To recap: I have 2 resource groups in a subscription where Packer is authorised at contributor level, let's call them Build and Image. Packer builds in the Build resource group, and then store the images in the Image resource group (from there, I then add them to a shared images gallery, but it is not significant for the issue at hand). I have excluded the Build resource group from the following policies: "Deploy prerequisites to enable Guest Configuration Policy on Windows VMs." "$SubscriptionName: Enable Monitoring in Azure Security Center."

This seem to have solved the issue in my scenario (so it was likely some sort of interaction between Windows Defender and the Monitoring Agent).

I stand by my previous statement to the Hashibot masters, that my generalising script, albeit improvable, is better than the one they show on their documentation (let alone Microsoft documentation, stating the the rearming may end in a infinite loop, apparently how to write while($true) exit conditions is a lost art those days). Here is a version with an hardcoded C:\ProgramData for the flag file (this can be done in a million better ways ..):

{
    "type": "powershell",
    "inline": [
        "if (((Get-Item -LiteralPath 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration' -ErrorAction SilentlyContinue).PSObject.Properties -ne $null) -and ((Get-ItemProperty 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration').PSObject.Properties.Name -contains 'AgentId')) {Set-ItemProperty -path 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration' -Name 'AgentId' -value ''}",
        "if((Get-Service | ? Name -eq RdAgent).count -gt 0) {while ((Get-Service RdAgent).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if((Get-Service | ? Name -eq WindowsAzureTelemetryService).count -gt 0) {while ((Get-Service WindowsAzureTelemetryService).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if((Get-Service | ? Name -eq WindowsAzureGuestAgent).count -gt 0) {while ((Get-Service WindowsAzureGuestAgent).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if( Test-Path $Env:SystemRoot\\System32\\Sysprep\\unattend.xml ){ rm $Env:SystemRoot\\System32\\Sysprep\\unattend.xml -Force}",
        "& $env:SystemRoot\\System32\\Sysprep\\Sysprep.exe /oobe /generalize /quiet /quit /mode:vm",
        "$attempts=0; while($true) { if ($attempts -gt 10) {Uninstall-WindowsFeature -Name Windows-Defender;New-Item -Path 'C:\\ProgramData' -Name rearm.do -ItemType File -Force; break}; $imageState = Get-ItemProperty HKLM:\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Setup\\State | Select ImageState; if($imageState.ImageState -ne 'IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE') { Write-Output $imageState.ImageState; if($imageState.ImageState -eq 'IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE'){$attempts++}; Start-Sleep -s 10  } else { Write-Output $imageState.ImageState; break } }",
        "if(Test-Path 'C:\\ProgramData\\rearm.do' -PathType Leaf){& $env:SystemRoot\\System32\\Sysprep\\Sysprep.exe /oobe /generalize /quiet /quit /mode:vm}",
        "if(Test-Path 'C:\\ProgramData\\rearm.do' -PathType Leaf){$attempts=0; while($true) { if ($attempts -gt 10) {break}; $imageState = Get-ItemProperty HKLM:\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Setup\\State | Select ImageState; if($imageState.ImageState -ne 'IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE') { Write-Output $imageState.ImageState; if($imageState.ImageState -eq 'IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE'){$attempts++}; Start-Sleep -s 10  } else { Write-Output $imageState.ImageState; break }}}"
    ],
    "pause_before": "1m"
}

And this an untested (not for long) version for Windows 10:

{
    "type": "powershell",
    "inline": [
        "if (((Get-Item -LiteralPath 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration' -ErrorAction SilentlyContinue).PSObject.Properties -ne $null) -and ((Get-ItemProperty 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration').PSObject.Properties.Name -contains 'AgentId')) {Set-ItemProperty -path 'HKLM:\\SOFTWARE\\Microsoft\\DesiredStateConfiguration' -Name 'AgentId' -value ''}",
        "if((Get-Service | ? Name -eq RdAgent).count -gt 0) {while ((Get-Service RdAgent).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if((Get-Service | ? Name -eq WindowsAzureTelemetryService).count -gt 0) {while ((Get-Service WindowsAzureTelemetryService).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if((Get-Service | ? Name -eq WindowsAzureGuestAgent).count -gt 0) {while ((Get-Service WindowsAzureGuestAgent).Status -ne 'Running') { Start-Sleep -s 5 }}",
        "if( Test-Path $Env:SystemRoot\\System32\\Sysprep\\unattend.xml ){ rm $Env:SystemRoot\\System32\\Sysprep\\unattend.xml -Force}",
        "& $env:SystemRoot\\System32\\Sysprep\\Sysprep.exe /oobe /generalize /quiet /quit /mode:vm",
        "$attempts=0; while($true) { if ($attempts -gt 10) {Disable-WindowsOptionalFeature -Online -NoRestart -FeatureName Windows-Defender-ApplicationGuard;Disable-WindowsOptionalFeature -Online -NoRestart -FeatureName Windows-Defender-Default-Definitions;New-Item -Path 'C:\\Install' -Name rearm.do -ItemType File -Force; break}; $imageState = Get-ItemProperty HKLM:\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Setup\\State | Select ImageState; if($imageState.ImageState -ne 'IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE') { Write-Output $imageState.ImageState; if($imageState.ImageState -eq 'IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE'){$attempts++}; Start-Sleep -s 10  } else { Write-Output $imageState.ImageState; break } }",
        "if(Test-Path 'C:\\Install\\rearm.do' -PathType Leaf){& $env:SystemRoot\\System32\\Sysprep\\Sysprep.exe /oobe /generalize /quiet /quit /mode:vm}",
        "if(Test-Path 'C:\\Install\\rearm.do' -PathType Leaf){$attempts=0; while($true) { if ($attempts -gt 10) {break}; $imageState = Get-ItemProperty HKLM:\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Setup\\State | Select ImageState; if($imageState.ImageState -ne 'IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE') { Write-Output $imageState.ImageState; if($imageState.ImageState -eq 'IMAGE_STATE_SPECIALIZE_RESEAL_TO_OOBE'){$attempts++}; Start-Sleep -s 10  } else { Write-Output $imageState.ImageState; break }}}"
    ],
    "pause_before": "1m"
}

Last but not least, the above provisioners should really be in script, not inline (and ideally, tested). And again, even better, we should have a specialised provisioner provided by HashiCorp & Microsoft for the generalisation of Windows and Linux.

Darleev commented 4 years ago

@aledeniz @davemedvitz thank you for the meaningful reply and the investigation details. Unfortunately, we are not able to reproduce the issue on our side. Everything works fine in our internal scripts. We use the following script to generate the image on Azure side. Below an example of how to start the script: GenerateResourcesAndImage -SubscriptionId ******* -ResourceGroupName "ResourceGroupTest -ImageGenerationRepositoryRoot "C:\RepositoryRoot" -ImageType Windows2019 -AzureLocation "East US" -GithubFeedToken ***************** Please note that you should have owner or contributor role on subscription to perform the script without any issues. Could you please try to use it, for image generation purposes?

Also, We don't manage packer provisioners and use them as users so unfortunately, we can't help to diagnose this issue deeper

Darleev commented 4 years ago

@aledeniz @davemedvitz I'm going to close the issue until further clarifications from your side. In case of any questions, feel free to contact us.

davemedvitz commented 4 years ago

Recognizing this was closed, and I apologize for not providing this update earlier, as I had a priority task come up.

We did, partially, identify this issue. We had begun applying security policies via Azure Policy. In particular, the ASC Default, and the NIST 800-53 R4 initiatives. These were applying to the windows machine during the build. Once we excluded this initiatives from the Resource Group where the build was being performed, we no longer had this issue.

I appreciate the time that was spent on this issue, and hope this info can be of use to others.

-Dave

Darleev commented 4 years ago

@davemedvitz Hello, Thank you for the provided solution and the investigation details. In case of any questions, feel free to contact us, we will be glad to assist you.

benvbr commented 4 years ago

I stumbled upon this issue as well. Removing the: "Deploy prerequisites to enable Guest Configuration Policy on Windows VMs." "$SubscriptionName: Enable Monitoring in Azure Security Center." policies (or configuring an exclusion for these policies) as mentioned by @aledeniz seems to resolve the issue.