Closed busetde closed 1 year ago
@mr-stringer - kindly advise here...
@busetde I did a GCP deployment last week which was fine, so I think the automation is working.
Did the fencing fail immediately or after some time?
Have you tried to cleanup the fencing resource?
Can you attach your terraform.tfvars and the output of grep 'rsc_gcp_stonith_HDB' /var/log/pacemaker/pacemaker.log
?
@mr-stringer
As per screenshot above that the deployment completed successfuly but fencing is stopped.
Did the fencing fail immediately or after some time?
After the deployment completed and access to the HANA instance and use crm_mon -rfn1
show the fencing already Stopped
Have you tried to cleanup the fencing resource?
Use crm resource cleanup
able to move the fencing resources to "Active Resources" but only after seconds and run crm_mon -rfn1
the fencing resources back again to Inactive Resources
Can you attach your terraform.tfvars and the output of grep 'rsc_gcp_stonith_HDB' /var/log/pacemaker/pacemaker.log? Attached is the terraform.tfvars and grep of pacemaker.log
Thanks - Budi
Thanks for the update.
Could I trouble you to send the output of crm configure show
and /var/log/messages?
@mr-stringer
Below is the output of crm configure show
and /var/log/messages
(only include salt*.log and sapconf.log) hope that's enough
Let me know if there's anything else needed...
Hi, @busetde. I'm a little confused by what I'm seeing in terms of data. Your screenshot shows the hostnames as default-budi01
and default-budi02
, but the files you sent all had the hostnames as default-vmhana01
and default-vmhana02
. For now, I'm going to assume you tried a second deploy with different hostnames and sent me files from that. If that is not correctly, let me know.
It looks like the stonith primitive has been deployed OK.
# This stonith resource and location will be duplicated for each node in the cluster
primitive rsc_gcp_stonith_HDB_HDB00_default-vmhana01 stonith:fence_gce \
params plug=default-vmhana01 pcmk_host_map="default-vmhana01:default-vmhana01" \
meta target-role=Started
# This stonith resource and location will be duplicated for each node in the cluster
primitive rsc_gcp_stonith_HDB_HDB00_default-vmhana02 stonith:fence_gce \
params plug=default-vmhana02 pcmk_host_map="default-vmhana02:default-vmhana02" \
meta target-role=Started
fence_gce uses the googleapiclient and oauth2client to interact with the gce API to fence nodes. As part of the automated deployment, we ask for a gcp credential file, the account specified in this file should have credentials for downloading the SAP media from the storage account and have the ability to fence nodes. The file is specified in the terraform variable gcp_credentials_file
and this gets copied to /root/gcp_credentials_file
.
Therefore, I'm thinking it's possible that the credentials used may have the permission to download from the storage account but not the ability to fence nodes.
We can test this. fence_gce is a python script, so we can easily call it manually. Can you try the following.
Login to node default-vmhana01
and, as root, run fence_gce -n default-vmhana02 -o status
. On my system, I get the message Status: ON
an rc=0.
If that doesn't work, it suggests that either there is a problem with the API, the googleapiclient or permissions. If it does work you should try and fence the second node with the command: fence_gce -n default-vmhana02 -o reboot
. This time you should get a message of Success: Rebooted
and an rc=0.
If the commands don't run you should check to see if /root/google_credentials.json
has the same content as the file specified in gcp_credentials_file
and that the user/account specified in that file has sufficient permission to fence the nodes.
I hope this helps. Let me know how you get on.
Steve.
Hi @mr-stringer - sorry for the confusion, yes the first one already destroy and then disable #hana_name
Looks like there's problem with API.
Run fence_gce -n default-vmhana02 -o status --zone=asia-southeast2-b
There's error that Required 'compute.instance.get' permission
Run fence_gce -n default-vmhana02 -o status --zone=asia-southeast2-b --serviceaccount=/root/google_credentials.json
Success with Status: ON
Any idea what user / service account used by the first fence command?
Run fence_gce -n default-vmhana02 -o reboot --zone=asia-southeast2-b --serviceaccount=/root/google_credentials.json
Got Success: Rebooted
and there's no rc=0
Run crm_mon -rfn1
There's error below:
Failed Fencing Actions:
* reboot of default-vmhana02 failed: delegate=default-vmhana01, client=pacemaker-controld.30100, origin=default-vmhana01, last-failed='2023-03-10 14:09:06Z'
Please kindly advise on what to do next?
By any chance is there more than one set of credentials in the file?
@mr-stringer - I've did the cat /root/google_credentias.json
it's show only one credentials hence when run fence_gce -n default-vmhana02 -o reboot --zone=asia-southeast2-b --serviceaccount=/root/google_credentials.json Got Success: Rebooted
Any advise please?
Hi, this is odd as I haven't seen this before.
We could probably fix this by adding serviceaccount
as a parameter in the stonith primitive, but then that leaks the location of the credentials to anyone who can run the crm
command.
Could you try adding the export GOOGLE_APPLICATION_CREDENTIALS="/root/google_credentials.json"
in /root/.bashrc
? You'll need to do this to both nodes and reboot them.
Let me know how you get on.
@mr-stringer - add as per suggestion
On default-vmhana01
run `fence_gce -n default-vmhana02 -o reboot'
On default-vmhana01
run `crm_mon -rfn1'
On the screenshot there's still Inactive Resources...
Please kindly advise....
To be clear, did you add GOOGLE_APPLICATION_CREDENTIALS to /root/.bashrc to both hosts and reboot both hosts?
To be clear, did you add GOOGLE_APPLICATION_CREDENTIALS to /root/.bashrc to both hosts and reboot both hosts?
@mr-stringer - Already added as below
for default-vmhana01
for default-vmhana02
Please kindly advise
I should have a little time today to build a system based on your terraform tfvars to attempt to replicate the issue. I'll keep you posted.
@mr-stringer - Is it better to check together on my environment?
Hi Budi, sorry for the delay in replying. I've been looking into this issue in some more detail.
It appears that the current automation relies on not on the service account file but on the IAM configuration of the instance. It may be possible for us to override this action in the future but it is not currently planned.
I see that you have two options.
1) Use the apply the required permission to the deployed instances - see this link
2) Alter the cluster configuration to pass the serviceaccount
parameter to the stonith primirive with the vale /root/google_credentials.json
. see this link
Hi @mr-stringer
Thanks for the hint on IAM It's due to the IAM configuration role that is not available cause the issues. It's working now...
Thanks - Budi
Used cloud platform GCP
Used SLES4SAP version SLES15 SP4 for SAP Applications
Used client machine OS macOS
Expected behaviour vs observed behaviour Deployment with HANA instances only successful
Expected behaviour: fencing started so there's no Inactive Resources Observed behaviour: fencing stopped shown on Inactive Resources
Check with
crm_mon -rfn1
Screenshot