GCP - Terraform Deployed Successfully but fencing Stopped

busetde commented 1 year ago

Used cloud platform GCP

Used SLES4SAP version SLES15 SP4 for SAP Applications

Used client machine OS macOS

Expected behaviour vs observed behaviour Deployment with HANA instances only successful

Expected behaviour: fencing started so there's no Inactive Resources Observed behaviour: fencing stopped shown on Inactive Resources

Check with crm_mon -rfn1

Screenshot

busetde commented 1 year ago

@mr-stringer - kindly advise here...

mr-stringer commented 1 year ago

@busetde I did a GCP deployment last week which was fine, so I think the automation is working.

Did the fencing fail immediately or after some time?

Have you tried to cleanup the fencing resource?

Can you attach your terraform.tfvars and the output of grep 'rsc_gcp_stonith_HDB' /var/log/pacemaker/pacemaker.log?

busetde commented 1 year ago

@mr-stringer

As per screenshot above that the deployment completed successfuly but fencing is stopped.

Did the fencing fail immediately or after some time? After the deployment completed and access to the HANA instance and use crm_mon -rfn1 show the fencing already Stopped

Have you tried to cleanup the fencing resource? Use crm resource cleanup able to move the fencing resources to "Active Resources" but only after seconds and run crm_mon -rfn1 the fencing resources back again to Inactive Resources

Can you attach your terraform.tfvars and the output of grep 'rsc_gcp_stonith_HDB' /var/log/pacemaker/pacemaker.log? Attached is the terraform.tfvars and grep of pacemaker.log

Thanks - Budi

mr-stringer commented 1 year ago

Thanks for the update.

Could I trouble you to send the output of crm configure show and /var/log/messages?

busetde commented 1 year ago

@mr-stringer Below is the output of crm configure show and /var/log/messages (only include salt*.log and sapconf.log) hope that's enough Let me know if there's anything else needed...

mr-stringer commented 1 year ago

Hi, @busetde. I'm a little confused by what I'm seeing in terms of data. Your screenshot shows the hostnames as default-budi01 and default-budi02, but the files you sent all had the hostnames as default-vmhana01 and default-vmhana02. For now, I'm going to assume you tried a second deploy with different hostnames and sent me files from that. If that is not correctly, let me know.

It looks like the stonith primitive has been deployed OK.

# This stonith resource and location will be duplicated for each node in the cluster
primitive rsc_gcp_stonith_HDB_HDB00_default-vmhana01 stonith:fence_gce \
    params plug=default-vmhana01 pcmk_host_map="default-vmhana01:default-vmhana01" \
    meta target-role=Started
# This stonith resource and location will be duplicated for each node in the cluster
primitive rsc_gcp_stonith_HDB_HDB00_default-vmhana02 stonith:fence_gce \
    params plug=default-vmhana02 pcmk_host_map="default-vmhana02:default-vmhana02" \
    meta target-role=Started

fence_gce uses the googleapiclient and oauth2client to interact with the gce API to fence nodes. As part of the automated deployment, we ask for a gcp credential file, the account specified in this file should have credentials for downloading the SAP media from the storage account and have the ability to fence nodes. The file is specified in the terraform variable gcp_credentials_file and this gets copied to /root/gcp_credentials_file.

Therefore, I'm thinking it's possible that the credentials used may have the permission to download from the storage account but not the ability to fence nodes.

We can test this. fence_gce is a python script, so we can easily call it manually. Can you try the following.

Login to node default-vmhana01 and, as root, run fence_gce -n default-vmhana02 -o status. On my system, I get the message Status: ON an rc=0.

If that doesn't work, it suggests that either there is a problem with the API, the googleapiclient or permissions. If it does work you should try and fence the second node with the command: fence_gce -n default-vmhana02 -o reboot. This time you should get a message of Success: Rebooted and an rc=0.

If the commands don't run you should check to see if /root/google_credentials.json has the same content as the file specified in gcp_credentials_file and that the user/account specified in that file has sufficient permission to fence the nodes.

I hope this helps. Let me know how you get on.

Steve.

busetde commented 1 year ago

Hi @mr-stringer - sorry for the confusion, yes the first one already destroy and then disable #hana_name

Looks like there's problem with API. Run fence_gce -n default-vmhana02 -o status --zone=asia-southeast2-b There's error that Required 'compute.instance.get' permission

Run fence_gce -n default-vmhana02 -o status --zone=asia-southeast2-b --serviceaccount=/root/google_credentials.json Success with Status: ON

Any idea what user / service account used by the first fence command?

Run fence_gce -n default-vmhana02 -o reboot --zone=asia-southeast2-b --serviceaccount=/root/google_credentials.json Got Success: Rebooted and there's no rc=0

Run crm_mon -rfn1 There's error below:

Failed Fencing Actions:
  * reboot of default-vmhana02 failed: delegate=default-vmhana01, client=pacemaker-controld.30100, origin=default-vmhana01, last-failed='2023-03-10 14:09:06Z'

Please kindly advise on what to do next?

mr-stringer commented 1 year ago

By any chance is there more than one set of credentials in the file?

busetde commented 1 year ago

@mr-stringer - I've did the cat /root/google_credentias.json it's show only one credentials hence when run fence_gce -n default-vmhana02 -o reboot --zone=asia-southeast2-b --serviceaccount=/root/google_credentials.json Got Success: Rebooted

Any advise please?

mr-stringer commented 1 year ago

Hi, this is odd as I haven't seen this before.

We could probably fix this by adding serviceaccount as a parameter in the stonith primitive, but then that leaks the location of the credentials to anyone who can run the crm command.

Could you try adding the export GOOGLE_APPLICATION_CREDENTIALS="/root/google_credentials.json" in /root/.bashrc? You'll need to do this to both nodes and reboot them.

Let me know how you get on.

busetde commented 1 year ago

@mr-stringer - add as per suggestion On default-vmhana01 run `fence_gce -n default-vmhana02 -o reboot'

On default-vmhana01 run `crm_mon -rfn1'

On the screenshot there's still Inactive Resources...

Please kindly advise....

mr-stringer commented 1 year ago

To be clear, did you add GOOGLE_APPLICATION_CREDENTIALS to /root/.bashrc to both hosts and reboot both hosts?

busetde commented 1 year ago

To be clear, did you add GOOGLE_APPLICATION_CREDENTIALS to /root/.bashrc to both hosts and reboot both hosts?

@mr-stringer - Already added as below for default-vmhana01

for default-vmhana02

Please kindly advise

mr-stringer commented 1 year ago

I should have a little time today to build a system based on your terraform tfvars to attempt to replicate the issue. I'll keep you posted.

busetde commented 1 year ago

@mr-stringer - Is it better to check together on my environment?

mr-stringer commented 1 year ago

Hi Budi, sorry for the delay in replying. I've been looking into this issue in some more detail.

It appears that the current automation relies on not on the service account file but on the IAM configuration of the instance. It may be possible for us to override this action in the future but it is not currently planned.

I see that you have two options.

1) Use the apply the required permission to the deployed instances - see this link

2) Alter the cluster configuration to pass the serviceaccount parameter to the stonith primirive with the vale /root/google_credentials.json. see this link

busetde commented 1 year ago

Hi @mr-stringer

Thanks for the hint on IAM It's due to the IAM configuration role that is not available cause the issues. It's working now...

Thanks - Budi

SUSE / ha-sap-terraform-deployments

GCP - Terraform Deployed Successfully but fencing Stopped #904