SUSE / ha-sap-terraform-deployments

Automated SAP/HA Deployments in Public/Private Clouds
GNU General Public License v3.0
120 stars 88 forks source link

Pre-deployment of HANA scale-out cluster with no stand-by nodes fails with Error "ParentResourceNotFound" #913

Open abravosuse opened 9 months ago

abravosuse commented 9 months ago

Used cloud platform Azure

Used SLES4SAP version SLES15SP4

Used client machine OS openSUSE Leap 15.2

Expected behaviour vs observed behaviour Expected behavior: Deployment of HANA scale-out cluster (with no standby node) Observed behavior: deployment fails

How to reproduce Specify the step by step process to reproduce the issue. This usually would look like something like this:

  1. Switch to the azure folder
  2. Create the terraform.tfvars file based on terraform.tfvars.example (content pasted later)
  3. Setup azure account
  4. Initialize terraform terraform init
  5. Create and switch to terraform workspace hsonsb terraform workspace new hsonsb
  6. Execute deployment terraform apply -auto-approve

Used terraform.tfvars

resource_group_name = "<my_rg>"
vnet_address_range = "10.130.0.0/16"
subnet_address_range = "10.130.1.0/24"
admin_user = "cloudadmin"
reg_code = "<my_internal_code>"
reg_email = "alberto.bravo@suse.com"
os_image = "SUSE:sles-sap-15-sp4-byos:gen2:latest"
public_key  = "~/.ssh/id_rsa_cloud.pub"
private_key = "~/.ssh/id_rsa_cloud"
cluster_ssh_pub = "salt://sshkeys/cluster.id_rsa.pub"
cluster_ssh_key = "salt://sshkeys/cluster.id_rsa"
ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:ha-clustering:sap-deployments:v9/"
provisioning_log_level = "debug"
pre_deployment = true
cleanup_secrets = true
bastion_enabled = false
hana_name = "vmhsonsb"
hana_count = "4"
hana_scale_out_enabled = true
hana_scale_out_standby_count = 0
hana_scale_out_shared_storage_type = "anf"
anf_pool_size                      = "15"
anf_pool_service_level             = "Ultra"
hana_scale_out_anf_quota_shared    = "2000"
storage_account_name = "<my_storage_account_name>"
storage_account_key = "<my_storage_account_key>"
hana_inst_master = "//<my_storage_account_name>.file.core.windows.net/hana/51055267"
hana_ha_enabled = true
hana_ips = ["10.130.1.11", "10.130.1.12", "10.130.1.13", "10.130.1.14"]
hana_cluster_vip = "10.130.1.15"
hana_sid = "SC1"
hana_instance_number = "30"
hana_master_password = "<my_password>"
hana_primary_site = "NBG"
hana_secondary_site = "WDF"
hana_cluster_fencing_mechanism = "sbd"
iscsi_name = "vmiscsihsonsb"
iscsi_srv_ip = "10.130.1.4"
hana_data_disks_configuration = {
disks_type       = "Premium_LRS,Premium_LRS,Premium_LRS,Premium_LRS,Premium_LRS"
disks_size       = "64,64,64,64,32,64"
caching          = "ReadOnly,ReadOnly,ReadOnly,ReadOnly,None"
writeaccelerator = "false,false,false,false,false"
luns             = "0,1#2,3#4#5"
names            = "data#log#usrsap#backup"
lv_sizes         = "100#100#30#60"
paths            = "/hana/data#/hana/log#/usr/sap#/hana/backup"
}

Logs

Full log files salt-os-setup.log, salt-predeployment.log and salt-result.log will be delivered via PM if needed. The deployment ends with the following messages:

Error: creating Volume: (Name "vmhsonsb-netapp-volume-shared-2" / Capacity Pool Name "netapp-pool-hsonsb" / Net App Account Name "netapp-acc-hsonsb" / Resource Group "<my_rg>"): netapp.VolumesClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="ParentResourceNotFound" Message="Failed to perform 'write' on resource(s) of type 'netAppAccounts/capacityPools/volumes', because the parent resource '/subscriptions/<subscription_id>/resourceGroups/<my_rg>/providers/Microsoft.NetApp/netAppAccounts/netapp-acc-hsonsb/capacityPools/netapp-pool-hsonsb' could not be found."

   with module.hana_node.azurerm_netapp_volume.hana-netapp-volume-shared[1],
   on modules/hana_node/main.tf line 339, in resource "azurerm_netapp_volume" "hana-netapp-volume-shared":
  339: resource "azurerm_netapp_volume" "hana-netapp-volume-shared" {

Error: creating Volume: (Name "vmhsonsb-netapp-volume-shared-1" / Capacity Pool Name "netapp-pool-hsonsb" / Net App Account Name "netapp-acc-hsonsb" / Resource Group "<my_rg>"): netapp.VolumesClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="ParentResourceNotFound" Message="Failed to perform 'write' on resource(s) of type 'netAppAccounts/capacityPools/volumes', because the parent resource '/subscriptions/<subscription_id>/resourceGroups/<my_rg>/providers/Microsoft.NetApp/netAppAccounts/netapp-acc-hsonsb/capacityPools/netapp-pool-hsonsb' could not be found."

   with module.hana_node.azurerm_netapp_volume.hana-netapp-volume-shared[0],
   on modules/hana_node/main.tf line 339, in resource "azurerm_netapp_volume" "hana-netapp-volume-shared":
  339: resource "azurerm_netapp_volume" "hana-netapp-volume-shared" {

 Error: remote-exec provisioner error

   with module.hana_node.module.hana_majority_maker.module.majority_maker_provision.null_resource.provision[0],
   on ../generic_modules/salt_provisioner/main.tf line 78, in resource "null_resource" "provision":
   78:   provisioner "remote-exec" {

 error executing "/tmp/terraform_1129644356.sh": Process exited with status
 1
yeoldegrove commented 9 months ago

@abravosuse From what I experienced in the past, the creation of the Netapp resources can take time and is error-prone to timing race conditions... I could see the same behavior that you see. The netapp volumes are being created even though the netapp pool is not yet created.

Could you try the follwing patch which passes the actual output/names of the resources (after they are created) to the HANA/Netweaver modules?

--- main.tf 2023-11-17 11:18:37.678249072 +0100
+++ main.tf.new 2023-11-17 11:18:25.115136635 +0100
@@ -219,9 +219,9 @@
   virtual_host_ips            = local.netweaver_virtual_ips
   iscsi_srv_ip                = join("", module.iscsi_server.iscsi_ip)
   # ANF specific
-  anf_account_name           = local.anf_account_name
-  anf_pool_name              = local.anf_pool_name
-  anf_pool_service_level     = var.anf_pool_service_level
+  anf_account_name           = azurerm_netapp_account.mynetapp-acc.0.name
+  anf_pool_name              = azurerm_netapp_pool.mynetapp-pool.0.name
+  anf_pool_service_level     = azurerm_netapp_pool.mynetapp-pool.0.service_level
   netweaver_anf_quota_sapmnt = var.netweaver_anf_quota_sapmnt
   # only used by azure fence agent (native fencing)
   subscription_id           = data.azurerm_subscription.current.subscription_id
@@ -255,9 +255,9 @@
   os_image                      = local.hana_os_image
   iscsi_srv_ip                  = join("", module.iscsi_server.iscsi_ip)
   # ANF specific
-  anf_account_name                = local.anf_account_name
-  anf_pool_name                   = local.anf_pool_name
-  anf_pool_service_level          = var.anf_pool_service_level
+  anf_account_name                = azurerm_netapp_account.mynetapp-acc.0.name
+  anf_pool_name                   = azurerm_netapp_pool.mynetapp-pool.0.name
+  anf_pool_service_level          = azurerm_netapp_pool.mynetapp-pool.0.service_level
   hana_scale_out_anf_quota_data   = var.hana_scale_out_anf_quota_data
   hana_scale_out_anf_quota_log    = var.hana_scale_out_anf_quota_log
   hana_scale_out_anf_quota_backup = var.hana_scale_out_anf_quota_backup
abravosuse commented 9 months ago

Thank you @yeoldegrove ! Just to be on the safe side . Applying the patch consists of updating lines 219-221 and 255-257 in file azure/main.tf as indicated above, correct?

yeoldegrove commented 9 months ago

@abravosuse Yeah, just delete the lines indicated with - and add the ones with +. The lines should me unique thought. Another way would be putting the patch in a file and simply running patch <patch1.patch.

abravosuse commented 7 months ago

I have followed your suggestion above @yeoldegrove . And the deployment fails now with the following errors:

│ Error: remote-exec provisioner error
│
│   with module.hana_node.module.hana_provision.null_resource.provision[2],
│   on ../generic_modules/salt_provisioner/main.tf line 78, in resource "null_resource" "provision":
│   78:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_709979505.sh": Process exited with status 1
╵
╷
│ Error: remote-exec provisioner error
│
│   with module.hana_node.module.hana_provision.null_resource.provision[1],
│   on ../generic_modules/salt_provisioner/main.tf line 78, in resource "null_resource" "provision":
│   78:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_1308350619.sh": Process exited with status
│ 1
╵
╷
│ Error: remote-exec provisioner error
│
│   with module.hana_node.module.hana_provision.null_resource.provision[0],
│   on ../generic_modules/salt_provisioner/main.tf line 78, in resource "null_resource" "provision":
│   78:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_2089967319.sh": Process exited with status
│ 1

These are errors in the salt provisioner. Therefore I guess I could get more details about them in the different hosts. But which ones?

Thank you!

yeoldegrove commented 7 months ago

@abravosuse I would need the /var/log/salt-* from hana01,02,03. Or you give me access to the hosts ;)

abravosuse commented 7 months ago

@yeoldegrove please send me your public SSH key and I will grant you access to the HANA hosts...