Power VS: Health status is not a reliable indicator for instance being up.

mjturek commented 2 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform IBM Provider Version

github.com/hashicorp/terraform-exec v0.16.1 github.com/IBM-Cloud/terraform-provider-ibm v1.44.2

Affected Resource(s)

ibm_pi_instance

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please share a link to the ZIP file.

Debug Output

https://gist.github.com/mjturek/fe8680828ed3f1b68eb31ab671072e91

Panic Output

None.

Expected Behavior

Instances should report as successfully launched. They are ACTIVE and our install would continue as expected if terraform recognized the instances as launched

Actual Behavior

Instance creation continued, waiting for health status to move to WARNING

Steps to Reproduce

This is not consistently reproducable, however I've seen it pretty much every time I launch in tor01.

Launch a pi_instance via terraform. In some cases, a VM will never leave PENDING health status, causing the VM to never report as launched.

Important Factoids

This has caused a significant number of failures in OpenShift IPI on Power VS.

We think the health status is not a reliable test of an instance being available and would prefer server status as ACTIVE to be the indicator of an instance being available.

References

0000

yussufsh commented 2 years ago

At least the health status should be changed from PENDING.. otherwise I think the vm is unusable. Would like to ask you to create a support ticket to understand why the vm is not coming out of PENDING health status.

cc @rmoralesjr

hamzy commented 2 years ago

In my experience, the VM is up and running as seen on the console. However TF is still waiting on the VM because of this (as seen on the CLI)

hamzy:hamzy-installer$ ibmcloud pi instances --json | jq -r '.Payload.pvmInstances[] | select (.serverName|test("rdr-hamzy")) | " \(.serverName) - \(.status) - health: \(.health.reason) - \(.health.status)"'
 rdr-hamzy-test-syd04-vnj4t-master-2 - ACTIVE - health: PENDING - PENDING
 rdr-hamzy-test-syd04-vnj4t-master-1 - ACTIVE - health: PENDING - PENDING
 rdr-hamzy-test-syd04-vnj4t-master-0 - ACTIVE - health: PENDING - PENDING

rmoralesjr commented 2 years ago

Here is the log lines with the time stamps to help the service broker developers locate the instance creation logs for the instance mentioned above. . . .

    Sep 6 08:29:04 tor-servicebroker-9f5d88686-7sn4b tor-servicebroker DEBUG 13:29:04.21Z tor-servicebroker-9f5d88686-7sn4b[ecc71] pcloud transaction_logging.go:86/Debugf ▶ (67-7d81) create pvm-instance rdr-ipi-mjturek-7qzh6-master-2 in b301a0b3bcb4417c9a969c4d53e6c8b3: (availabilityZone: s922 flavor: {"disk":120,"extra_specs":{"powervm:dedicated_proc":"false","powervm:max_mem":"262144","powervm:max_proc_units":"4","powervm:max_vcpu":"8","powervm:min_mem":"4096","powervm:min_proc_units":"0.25","powervm:min_vcpu":"1","powervm:proc_units":"0.5","powervm:shared_weight":"128","powervm:srr_capability":"true","powervm:storage_connectivity_group":"292ef54d-665b-44d5-9149-1878765832aa","powervm:uncapped":"true"},"ram":32768,"vcpus":1}, server: {"ServerCreationList":[{"Name":"rdr-ipi-mjturek-7qzh6-master-2","AvailabilityZone":"s922","CloudInit":"","CloudInitBytes”:”. . . “}],”BaseName":"rdr-ipi-mjturek-7qzh6-master-2","SysType":"s922","OsType":"rhel","ProcMode":"shared","Capped":false,"SkipHostValidation":false,"Cores":0.5,"VCPUs":1,"MemoryMB":32768,"ImageID":"cfde9c76-4c35-44fc-bb5b-382fe8224fc2","RootDiskSizeGB":120,"Networks":[{"networkID":"a2cab25d-da48-4e9e-85c2-e21990039e8d"}],"UserStorageSelectionValues":{},"StockImageStorageSelectionValues":null,"VolumeIDs":[],"ServerGroup":null,"AffinityPolicy":"none","SSHKeyName":"rdr-ipi-mjturek-7qzh6-key","SSHKeyID":"65b64c1f1c29460e8c2e4bbfbd893c2c_7912870e-f4eb-4967-b83a-31a716d642ff_rdr-ipi-mjturek-7qzh6-key","SoftwareLicenses":{"ibmiCSS":false,"ibmiDBQ":false,"ibmiPHA":false,"ibmiRDS":false},"UserData”:”. . . . “,”UserDataBytes":null,"MaxHostCores":15,"MaxHostMemoryMB":789504,"PinPolicy":"none","Metadata":{},"SAPProfile":null,"StorageConnectivityGroup":"292ef54d-665b-44d5-9149-1878765832aa","VTLRepositoryCapacity":0,"PlacementGroup":"","LinuxSubscriptionType":null,"DeploymentType":"","SharedProcessorPool":"","SPPResources":null,"Migratable":false})

. . . .

    Sep 6 08:29:05 tor-servicebroker-9f5d88686-7sn4b tor-servicebroker DEBUG 13:29:05.769Z tor-servicebroker-9f5d88686-7sn4b[ecc7b] pcloud transaction_logging.go:86/Debugf ▶ (67-7d81) create pvm-instance rdr-ipi-mjturek-7qzh6-master-2 in b301a0b3bcb4417c9a969c4d53e6c8b3 successful, instance is being created with id 395c335e-00ae-4f10-b86f-c87b18732166

yussufsh commented 2 years ago

@hamzy since this is a one-off case where vm health status is not reported properly from the service broker, can we close this? Terraform code works as per design and the point is we cannot actually use a VM is actual health status is PENDING. So no point for changing the target checks.

hamzy commented 2 years ago

It seems the motivation to fix the problem is when it is happening to a user who has no work around to get the cluster up. While the problem doesn't happen often, it has happened somewhat recently. In my opinion, the TF code is using the wrong status to determine if a VM is up since you can access the console and interact with the running VM that way. The TF code just never progresses past this problem.

yussufsh commented 2 years ago

The problem here is that we cannot and should not check the console or interact with the user VM. Terraform will rely on the data provided in the API response. This should be reported to PowerVS service broker team which needs to fix the original issue, we cannot do anything more from Terraform side. Also the user can handle such cases by using state management commands terraform import or terraform taint to include created VM or re-create VM respectively, with confidence that the VM is actually running or not.

christopher-horn commented 9 months ago

@yussufsh @rmoralesjr been a while since I reviewed open issues, wanted to throw my thoughts here...

I have opened tickets with support a number of times over the last few years because of this issue, we have seen this numerous times with our deployments. Support was never able to find a problem.

I could be wrong, but from what I have seen and grown to understand I think the actual issue here may not be in the Service Broker layer at all. I think the CloudFlare caching layer is the root cause of the problem, returning stale data from the cache. When I have seen this happen we always see the correct VM status in the UI, but the logs show that Terraform keeps seeing the PENDING status in the API responses and eventually fails because of it.

Assuming I am right, I do not know much about how CloudFlare caching works, or if there is anything the PowerVS team can do to prevent an issue like this. It is not even clear to me where responsibility lies in fixing it, if with PowerVS team or CloudFlare. And because it is an intermittent issue that sometimes comes and goes it may be difficult to track down and fix for good.

christopher-horn commented 9 months ago

One more comment on this. When we have seen this happen it is usually site related. It usually affects all deploys at the site, and sometimes goes away after a day or two, but we have had cases where it dragged out a few weeks before things just magically started working fine again.

yussufsh commented 9 months ago

@christopher-horn I suggest looking at CLI instead of UI for instance state and health status. Terraform will return if health status is OK or WARNING as per pi_health_status argument. Even with CLI If the health status is not one of those then it is service broker or cache layer issue.

VM will be stuck in health status WARNING in case RMC is not reporting the status from within the VM. That is why we had added that argument for the resource to complete.

I have not seen or heard of this issue from a long time now.

@michaelkad @ismirlia can you help here?

christopher-horn commented 9 months ago

@yussufsh when I have seen this happen the CLI too will show PENDING status on the VM, just as Terraform sees. And since the UI shows a valid heath status that is what lead me to conclude that stale cache data from CloudFlare likely explained what was happening.

I think we last saw this issue at a site sometime back in early Fall last year. I know we have not seen it in the last few months.

hamzy commented 9 months ago

Thankfully we are moving away from TF and their broken implementation in 4.16+.

rmoralesjr commented 9 months ago

Since Michael and Axel are working on Terraform and know the latest info I'll let them comment.

IBM-Cloud / terraform-provider-ibm