Azure / sap-automation

This is the repository supporting the SAP deployment automation framework on Azure
MIT License
121 stars 137 forks source link

[BUG] Control plane deployment fails: polling after CreateOrUpdate: context deadline exceeded #609

Closed lpalovsky closed 3 weeks ago

lpalovsky commented 1 month ago

Describe the bug New deployment of control plane fails with an error: polling after CreateOrUpdate: context deadline

Full message:

{"@level":"error","@message":"Error: creating/updating Extension (Subscription: \"<redacted>\"\nResource Group Name: \"<redacted>\"\nVirtual Machine Name: \"LAB-SECE-DEP05_labsecedep05deploy00\"\nExtension Name: \"configure_deployer\"): polling after CreateOrUpdate: context deadline exceeded","@module":"terraform.ui","@timestamp":"2024-07-31T07:10:21.327808Z","diagnostic":{"severity":"error","summary":"creating/updating Extension (Subscription: \"<redacted>\"\nResource Group Name: \"<redacted>\"\nVirtual Machine Name: \"LAB-SECE-DEP05_labsecedep05deploy00\"\nExtension Name: \"configure_deployer\"): polling after CreateOrUpdate: context deadline exceeded","detail":"","address":"module.sap_deployer.azurerm_virtual_machine_extension.configure[0]","range":{"filename":"../../terraform-units/modules/sap_deployer/vm-deployer.tf","start":{"line":218,"column":58,"byte":14456},"end":{"line":218,"column":59,"byte":14457}},"snippet":{"context":"resource \"azurerm_virtual_machine_extension\" \"configure\"","code":"resource \"azurerm_virtual_machine_extension\" \"configure\" {","start_line":218,"highlight_start_offset":57,"highlight_end_offset":58,"values":[]}},"type":"diagnostic"}

The error above happens somewhere in the middle of the deployment but process continues and in the end fails with:

{"@level":"error","@message":"Error: A resource with the ID \"/subscriptions/<redacted>/resourceGroups/<redacted>/providers/Microsoft.Compute/virtualMachines/LAB-SECE-DEP05_labsecedep05deploy00/extensions/configure_deployer\" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for \"azurerm_virtual_machine_extension\" for more information.","@module":"terraform.ui","@timestamp":"2024-07-31T07:41:11.348134Z",

Setup of the local machine: SLES ver. 15SP3 az cli:

az --version
azure-cli                         2.38.2 *
core                              2.38.2 *
telemetry                          1.0.6 *
Dependencies:
msal                            1.18.0b1
azure-mgmt-resource             21.1.0b1

terraform:

azureadm@localhost:~> terraform -v
Terraform v1.9.3
on linux_amd64

SDAF version: master branch

To reproduce Steps to reproduce the behavior:

  1. general preparation according to official guide
  2. executed:
    ${SAP_AUTOMATION_REPO_PATH}/deploy/scripts/deploy_controlplane.sh --deployer_parameter_file "${deployer_parameter_file}" --library_parameter_file "${library_parameter_file}" --subscription "${ARM_SUBSCRIPTION_ID}" --spn_id "${ARM_CLIENT_ID}" --spn_secret "${ARM_CLIENT_SECRET}" --tenant_id "${ARM_TENANT_ID}"  --auto-approve
  3. See error
lpalovsky commented 1 month ago

I just realized that after failure, if I open VM overview from azure WEB ui, it says: 'Virtual machine agent status not ready'

image

devanshjainms commented 1 month ago

Hi @lpalovsky , can you try rebooting the machine? And then also rerun the Control Plane Deployment pipeline?

lpalovsky commented 1 month ago

Hi @lpalovsky , can you try rebooting the machine? And then also rerun the Control Plane Deployment pipeline?

Hello @devanshjainms. Thanks for the reply. I tried to redeploy everything from scratch again today, but this time I used SDAF release v3.11.0.3. (We agreed with @hdamecharla to use this as stable baseline. )

I experienced the same situation with deployment running for about an hour and then failing with same error as above and seeing the yellow banner in WebUI. After VM restart the banner disappeared but deployment script fails with the same error:

{"@level":"error","@message":"Error: updating Linux Virtual Machine (Subscription: \"<REDACTED>\"\nResource Group Name: \"<REDACTED>\"\nVirtual Machine Name: \"LAB-SECE-DEP05_labsecedep05deploy00\"): polling after Update: context deadline exceeded"

Thanks a lot for your help.

Lumir

lpalovsky commented 1 month ago

Hello, just an update with an observation I made: At the beginning of the deployment the VM is fine, but after Library is deployed, waagent service goes into 'dead' state. It is possible to start it normally, but process seems to be still hanging.

● waagent.service - Azure Linux Agent
     Loaded: loaded (/usr/lib/systemd/system/waagent.service; enabled; vendor preset: disabled)
     Active: inactive (dead) since Thu 2024-08-08 08:01:19 UTC; 35min ago
  Condition: start condition failed at Thu 2024-08-08 08:01:19 UTC; 35min ago
   Main PID: 2837 (code=exited, status=0/SUCCESS)

Aug 08 07:56:37 labsecedep05deploy00 sudo[4229]: pam_unix(sudo:session): session opened for user root by (uid=0)
Aug 08 07:58:00 labsecedep05deploy00 sudo[4368]:     root : PWD=/var/lib/waagent/custom-script/download/0 ; USER=root ; COMMAND=…b-release
Aug 08 07:58:00 labsecedep05deploy00 sudo[4368]: pam_unix(sudo:session): session opened for user root by (uid=0)
Aug 08 07:58:27 labsecedep05deploy00 sudo[4936]:     root : PWD=/var/lib/waagent/custom-script/download/0 ; USER=root ; COMMAND=…ive patch
Aug 08 07:58:27 labsecedep05deploy00 sudo[4936]: pam_unix(sudo:session): session opened for user root by (uid=0)
Aug 08 08:01:19 labsecedep05deploy00 systemd[1]: Stopping Azure Linux Agent...
Aug 08 08:01:19 labsecedep05deploy00 python3[2837]: 2024-08-08T08:01:19.795971Z INFO Daemon Daemon Agent WALinuxAgent-2.8.0.11 fo…2.8.0.11
Aug 08 08:01:19 labsecedep05deploy00 systemd[1]: waagent.service: Succeeded.
Aug 08 08:01:19 labsecedep05deploy00 systemd[1]: Stopped Azure Linux Agent.
Aug 08 08:01:19 labsecedep05deploy00 systemd[1]: Condition check resulted in Azure Linux Agent being skipped.
Hint: Some lines were ellipsized, use -l to show in full.

As @KimForss pointed out on today call, it might be an infra glitch, so I will try to deploy in different region. Will come back with an update.

Looks like I should be able to finish the deployer VM configuration using the configure_deployer.sh script.

lpalovsky commented 3 weeks ago

Hello. I am closing the issue since I don't experience it anymore. I deployed already at least 2 environments without problems. Might have been a temporary infra issue or something.

devanshjainms commented 3 weeks ago

Thanks for the update. @lpalovsky