SUSE / ha-sap-terraform-deployments

Automated SAP/HA Deployments in Public/Private Clouds
GNU General Public License v3.0
122 stars 88 forks source link

SAP HANA 2 NODE + HA Cluster DEPLOY do not work on AWS #832

Closed picoroma closed 2 years ago

picoroma commented 2 years ago

Used cloud platform AWS

Used SLES4SAP version SLES15SP2

Used client machine OS Windows10

Expected behaviour vs observed behaviour I have deployed with success a single NODE HANA System With or Without Monitoring Option When I ADD a second HANA NODE (hana_count = "2") The installation do not finish. Hana is not installed nor into NODE1 nor in NODE2. If I add even the parameter: "hana_ha_enabled = true" The HANA installation works fine - BUT the HA Cluster is not installed correctly.

My Question Is: What option are mandatory to install a 2 or more HANA NODE without HA Cluster? And What parameters are mandatory for HANA multinode and HA Cluster ?

THX

yeoldegrove commented 2 years ago

@picoroma How exactly was your procedure? Did you first run the buildup of a single node system (hana_count = 1), changed the parameter to 2 and ran terraform apply a second time? --> This is not going to work that easily. or Did you run a clean new buildup with hana_count = 2 and hana_ha_enabled = true ? --> These are the 2 parameters that control the deployment to be HA (>1 and true).

picoroma commented 2 years ago

@yeoldegrove : Usually I destroy a DEPLOY, before run a new one. So, the sequence I run is usually: terraform plan terraform apply terraform destroy and then a new terraform apply (sometimes without a new plan)

When I used: hana_count = 2 and hana_ha_enabled = true I had 2 HANA node installed correctly (with HANA System replication). But I have error on OS Cluster. The SUSE cluster was not deployed correctly. Do I need some othe parameter to setup in AWS ? I have performed another deploy. The erros seems to be related to the monitoring of the HANA cluster. I Attache the latest part of the deploy output with the error SUSE-HA-DEPLOY-ERROR.txt

yeoldegrove commented 2 years ago

@picoroma Thanks for the log... In the meantime I was able to reproduce you error.

At the moment I suspect the issue to be related to the change of instance type in https://github.com/SUSE/ha-sap-terraform-deployments/pull/822.

old instance type (xen)

ip-10-0-0-5:~ # python3 -c "from crmsh import utils; print(utils.detect_cloud());"
amazon-web-services

new instance type (nitro/kvm)

ip-10-0-0-5:~ # python3 -c "from crmsh import utils; print(utils.detect_cloud());"
None

The above python code (https://github.com/ClusterLabs/crmsh/blob/347f815c6565d0f8d8d5472a5640cfc1ce78ccb5/crmsh/utils.py#L2054) is used by https://github.com/SUSE/salt-shaptools/blob/835d199a6117b0b5657f14ae8fc296af7709f382/salt/modules/crmshmod.py#L707 and https://github.com/SUSE/salt-shaptools/blob/835d199a6117b0b5657f14ae8fc296af7709f382/salt/states/crmshmod.py#L595 which us again used by e.g. https://github.com/SUSE/saphanabootstrap-formula/blob/038ee4d6b542365e790c47e942efabedc196fa72/templates/cluster_resources.j2#L4 to decide which cloud is used.

As you see above, this code is currently broken... I will try to fix it and/or come up with a workaround. One workaround would be going back to the old instance types... but these were abandoned because we had other issues with these (see PR).

picoroma commented 2 years ago

Can I "force" to use temporary the OLD Instance Type ? IN case which Type Of instance I have to choose, for example ? WRONG: hana_instancetype = "r6i.xlarge" RIGHT hana_instancetype = "????" Can I use r5.2xlarge or r5.4xlarge In the meantime a workaroundis provided ?

yeoldegrove commented 2 years ago

You could try using the old instance types here: https://github.com/SUSE/ha-sap-terraform-deployments/pull/822/files#diff-c4686714aa47252c9b02d1319b932187b5d7e2182279ecf8f69935a469a3469dL211 e.g. hana_instancetype = r3.8xlarge but... #822 came for a reason and after a reboot your nodes might not come up.

yeoldegrove commented 2 years ago

I proposed a fix here: https://github.com/ClusterLabs/crmsh/pull/952 Let's see how fast we can get this merged.

yeoldegrove commented 2 years ago

@picoroma https://github.com/SUSE/salt-shaptools/pull/87 is a workaround that is merged and available with ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/v8"

Please try out if this fixes it for you.

In the meantime we're working to getting the crmsh fix available in SLES.

yeoldegrove commented 2 years ago

Until the 8.0.1 release, you have to use the develop branch to use the workaround.

picoroma commented 2 years ago

NO. I tried again. This times HANA on 2nd node was not not Installed at all. Even the cluster features on 2nd NOTE is missing Hana installed correctly ONLY on 1st node Monitoring Installed but missing exporter on 2nd node

The deploy finish with this error: module.hana_node.module.hana_provision.null_resource.provision[0]: Creation complete after 24m23s [id=656053581]

│ Error: remote-exec provisioner error │ │ with module.hana_node.module.hana_provision.null_resource.provision[1], │ on ..\generic_modules\salt_provisioner\main.tf line 78, in resource "null_resource" "provision": │ 78: provisioner "remote-exec" { │ │ error executing "/tmp/terraform_213658815.sh": Process exited with status 1

On 2nd Note - salt-result.log file reports:

Summary for local

Succeeded: 37 (changed=31) Failed: 0

Total states run: 37 Total run time: 214.288 s Mon Mar 21 08:49:52 UTC 2022::vmhana02::[INFO] predeployment done local: Data failed to compile:

Rendering SLS 'base:hana.monitoring' failed: while constructing a mapping

in "", line 12, column 1 found conflicting ID 'install_python_pip' in "", line 79, column 1 Mon Mar 21 08:50:01 UTC 2022::vmhana02::[ERROR] deployment failed vmhana02:/var/log # r

picoroma commented 2 years ago

Perform another attempts with log level = info. + Monitoring disabled

This is the Output of deploy:

module.hana_node.module.hana_provision.null_resource.provision[1] (remote-exec): [ERROR ] b'nameserver vmhana02:30001 not responding.' module.hana_node.module.hana_provision.null_resource.provision[1]: Still creating... [54m44s elapsed] module.hana_node.module.hana_provision.null_resource.provision[1] (remote-exec): [INFO ] b'adding site ...' module.hana_node.module.hana_provision.null_resource.provision[1] (remote-exec): [INFO ] b'collecting information ...' module.hana_node.module.hana_provision.null_resource.provision[1] (remote-exec): [INFO ] b'unable to contact primary site host vmhana01:40002. internal error,location=vmhana01:40002. Trying old-style port (port offset +100)...vmhana01:40002' module.hana_node.module.hana_provision.null_resource.provision[1] (remote-exec): [INFO ] b'unable to contact primary site; to vmhana01:30102; original error: internal error,location=vmhana01:30102; ' module.hana_node.module.hana_provision.null_resource.provision[1] (remote-exec): [INFO ] b'failed. trace file nameserver_vmhana02.00000.000.trc may contain more error details.' module.hana_node.module.hana_provision.null_resource.provision[1] (remote-exec): [ERROR ] b'nameserver vmhana02:30001 not responding.' module.hana_node.module.hana_provision.null_resource.provision[1]: Still creating... [54m54s elapsed]

yeoldegrove commented 2 years ago

@picoroma https://github.com/SUSE/ha-sap-terraform-deployments/releases/tag/8.0.1 hast just released, including fixes for this issue. It would be cool if you could confirm that it works now.

picoroma commented 2 years ago

Tryed with 2 Nodes with HA enabled and Moinitoring Enabled. Still have error. 2nd Node is not managed. No FileSystem Attached no HANA is installed HANA was installed only on 1st node. The deploy finish with this message:

module.hana_node.module.hana_provision.null_resource.provision[0] (remote-exec): Succeeded: 48 (changed=31) module.hana_node.module.hana_provision.null_resource.provision[0] (remote-exec): Failed: 0 module.hana_node.module.hana_provision.null_resource.provision[0] (remote-exec): ------------- module.hana_node.module.hana_provision.null_resource.provision[0] (remote-exec): Total states run: 48 module.hana_node.module.hana_provision.null_resource.provision[0] (remote-exec): Total run time: 764.523 s module.hana_node.module.hana_provision.null_resource.provision[0] (remote-exec): Wed Mar 23 15:26:41 UTC 2022::vmhana01::[INFO] deployment done module.hana_node.module.hana_provision.null_resource.provision[0]: Creation complete after 25m33s [id=209775395] ╷ │ Error: remote-exec provisioner error │ │ with module.hana_node.module.hana_provision.null_resource.provision[1], │ on ..\generic_modules\salt_provisioner\main.tf line 65, in resource "null_resource" "provision": │ 65: provisioner "remote-exec" { │ │ error executing "/tmp/terraform_971219650.sh": Process exited with status 1

I can send salt*.log file if needed - but i still see error with SUSE Module - for example:

Please check if the URIs defined for this repository are pointing to a valid repository. Skipping repository 'SLE-Product-SLES_SAP15-SP2-Updates' because of the above error. Repository 'SLE-Module-Server-Applications15-SP2-Pool' is invalid. [Server_Applications_Module_x86_64:SLE-Module-Server-Applications15-SP2-Pool|plugin:/susecloud?credentials=Server_Applications_Module_x86_64&path=/repo/SUSE/Products/SLE-Module-Server-Applications/15-SP2/x86_64/product/] Valid metadata not found at specified URL History:

Please check if the URIs defined for this repository are pointing to a valid repository. Skipping repository 'SLE-Module-Server-Applications15-SP2-Pool' because of the above error. Repository 'SLE-Module-Server-Applications15-SP2-Updates' is invalid. [Server_Applications_Module_x86_64:SLE-Module-Server-Applications15-SP2-Updates|plugin:/susecloud?credentials=Server_Applications_Module_x86_64&path=/repo/SUSE/Updates/SLE-Module-Server-Applications/15-SP2/x86_64/update/] Valid metadata not found at specified URL History:

yeoldegrove commented 2 years ago

@picoroma Your latest reported issues are most likely related to the SUSEConnect or registercloudguest infrastructure (or code).

Which image are you using exactly and is it PAYG or BYOL? A short test from my side (just now) did not show any issues with os_image = "suse-sles-sap-15-sp2" (which is the default PAYG image) and aws_region = "us-east-2". Another experience from my side is that it depends on the cloud provider, time of day and availability zone when you hit these kind of issues.

picoroma commented 2 years ago

I think can this can be related to the OS Image I'm using that is a BYOL one. I had a similar issue even with AWS Launch Wizard script for SAP. I opened a case to AWS and they says:

I have received an update from our internal team confirming that the problem was due to recent SUSE change in the registration of BYOS AMIs : https://www.suse.com/c/byos-instances-and-the-suse-public-cloud-update-infrastructure/

Our team has informed me that AWS Launch Wizard service will rollout the fix to handle SUSE updates in registration of BYOS AMI’s to all regions by 3/4. I hope that this is helpful.

I do not know if this helps you to troubleshoot. Anyway I will perform a new deploy using ye PAYG OS and give you a feedback

yeoldegrove commented 2 years ago

@picoroma Closing this. We're happy to investigate/reopen if you still have this issue.