Open mjturek opened 2 years ago
At least the health status should be changed from PENDING.. otherwise I think the vm is unusable. Would like to ask you to create a support ticket to understand why the vm is not coming out of PENDING health status.
cc @rmoralesjr
In my experience, the VM is up and running as seen on the console. However TF is still waiting on the VM because of this (as seen on the CLI)
hamzy:hamzy-installer$ ibmcloud pi instances --json | jq -r '.Payload.pvmInstances[] | select (.serverName|test("rdr-hamzy")) | " \(.serverName) - \(.status) - health: \(.health.reason) - \(.health.status)"'
rdr-hamzy-test-syd04-vnj4t-master-2 - ACTIVE - health: PENDING - PENDING
rdr-hamzy-test-syd04-vnj4t-master-1 - ACTIVE - health: PENDING - PENDING
rdr-hamzy-test-syd04-vnj4t-master-0 - ACTIVE - health: PENDING - PENDING
Here is the log lines with the time stamps to help the service broker developers locate the instance creation logs for the instance mentioned above. . . .
Sep 6 08:29:04 tor-servicebroker-9f5d88686-7sn4b tor-servicebroker DEBUG 13:29:04.21Z tor-servicebroker-9f5d88686-7sn4b[ecc71] pcloud transaction_logging.go:86/Debugf ▶ (67-7d81) create pvm-instance rdr-ipi-mjturek-7qzh6-master-2 in b301a0b3bcb4417c9a969c4d53e6c8b3: (availabilityZone: s922 flavor: {"disk":120,"extra_specs":{"powervm:dedicated_proc":"false","powervm:max_mem":"262144","powervm:max_proc_units":"4","powervm:max_vcpu":"8","powervm:min_mem":"4096","powervm:min_proc_units":"0.25","powervm:min_vcpu":"1","powervm:proc_units":"0.5","powervm:shared_weight":"128","powervm:srr_capability":"true","powervm:storage_connectivity_group":"292ef54d-665b-44d5-9149-1878765832aa","powervm:uncapped":"true"},"ram":32768,"vcpus":1}, server: {"ServerCreationList":[{"Name":"rdr-ipi-mjturek-7qzh6-master-2","AvailabilityZone":"s922","CloudInit":"","CloudInitBytes”:”. . . “}],”BaseName":"rdr-ipi-mjturek-7qzh6-master-2","SysType":"s922","OsType":"rhel","ProcMode":"shared","Capped":false,"SkipHostValidation":false,"Cores":0.5,"VCPUs":1,"MemoryMB":32768,"ImageID":"cfde9c76-4c35-44fc-bb5b-382fe8224fc2","RootDiskSizeGB":120,"Networks":[{"networkID":"a2cab25d-da48-4e9e-85c2-e21990039e8d"}],"UserStorageSelectionValues":{},"StockImageStorageSelectionValues":null,"VolumeIDs":[],"ServerGroup":null,"AffinityPolicy":"none","SSHKeyName":"rdr-ipi-mjturek-7qzh6-key","SSHKeyID":"65b64c1f1c29460e8c2e4bbfbd893c2c_7912870e-f4eb-4967-b83a-31a716d642ff_rdr-ipi-mjturek-7qzh6-key","SoftwareLicenses":{"ibmiCSS":false,"ibmiDBQ":false,"ibmiPHA":false,"ibmiRDS":false},"UserData”:”. . . . “,”UserDataBytes":null,"MaxHostCores":15,"MaxHostMemoryMB":789504,"PinPolicy":"none","Metadata":{},"SAPProfile":null,"StorageConnectivityGroup":"292ef54d-665b-44d5-9149-1878765832aa","VTLRepositoryCapacity":0,"PlacementGroup":"","LinuxSubscriptionType":null,"DeploymentType":"","SharedProcessorPool":"","SPPResources":null,"Migratable":false})
. . . .
Sep 6 08:29:05 tor-servicebroker-9f5d88686-7sn4b tor-servicebroker DEBUG 13:29:05.769Z tor-servicebroker-9f5d88686-7sn4b[ecc7b] pcloud transaction_logging.go:86/Debugf â–¶ (67-7d81) create pvm-instance rdr-ipi-mjturek-7qzh6-master-2 in b301a0b3bcb4417c9a969c4d53e6c8b3 successful, instance is being created with id 395c335e-00ae-4f10-b86f-c87b18732166
@hamzy since this is a one-off case where vm health status is not reported properly from the service broker, can we close this? Terraform code works as per design and the point is we cannot actually use a VM is actual health status is PENDING. So no point for changing the target checks.
It seems the motivation to fix the problem is when it is happening to a user who has no work around to get the cluster up. While the problem doesn't happen often, it has happened somewhat recently. In my opinion, the TF code is using the wrong status to determine if a VM is up since you can access the console and interact with the running VM that way. The TF code just never progresses past this problem.
The problem here is that we cannot and should not check the console or interact with the user VM. Terraform will rely on the data provided in the API response. This should be reported to PowerVS service broker team which needs to fix the original issue, we cannot do anything more from Terraform side.
Also the user can handle such cases by using state management commands terraform import
or terraform taint
to include created VM or re-create VM respectively, with confidence that the VM is actually running or not.
@yussufsh @rmoralesjr been a while since I reviewed open issues, wanted to throw my thoughts here...
I have opened tickets with support a number of times over the last few years because of this issue, we have seen this numerous times with our deployments. Support was never able to find a problem.
I could be wrong, but from what I have seen and grown to understand I think the actual issue here may not be in the Service Broker layer at all. I think the CloudFlare caching layer is the root cause of the problem, returning stale data from the cache. When I have seen this happen we always see the correct VM status in the UI, but the logs show that Terraform keeps seeing the PENDING status in the API responses and eventually fails because of it.
Assuming I am right, I do not know much about how CloudFlare caching works, or if there is anything the PowerVS team can do to prevent an issue like this. It is not even clear to me where responsibility lies in fixing it, if with PowerVS team or CloudFlare. And because it is an intermittent issue that sometimes comes and goes it may be difficult to track down and fix for good.
One more comment on this. When we have seen this happen it is usually site related. It usually affects all deploys at the site, and sometimes goes away after a day or two, but we have had cases where it dragged out a few weeks before things just magically started working fine again.
@christopher-horn I suggest looking at CLI instead of UI for instance state and health status. Terraform will return if health status is OK or WARNING as per pi_health_status argument. Even with CLI If the health status is not one of those then it is service broker or cache layer issue.
VM will be stuck in health status WARNING in case RMC is not reporting the status from within the VM. That is why we had added that argument for the resource to complete.
I have not seen or heard of this issue from a long time now.
@michaelkad @ismirlia can you help here?
@yussufsh when I have seen this happen the CLI too will show PENDING status on the VM, just as Terraform sees. And since the UI shows a valid heath status that is what lead me to conclude that stale cache data from CloudFlare likely explained what was happening.
I think we last saw this issue at a site sometime back in early Fall last year. I know we have not seen it in the last few months.
Thankfully we are moving away from TF and their broken implementation in 4.16+.
Since Michael and Axel are working on Terraform and know the latest info I'll let them comment.
Community Note
Terraform CLI and Terraform IBM Provider Version
github.com/hashicorp/terraform-exec v0.16.1 github.com/IBM-Cloud/terraform-provider-ibm v1.44.2
Affected Resource(s)
Terraform Configuration Files
Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.
Debug Output
https://gist.github.com/mjturek/fe8680828ed3f1b68eb31ab671072e91
Panic Output
None.
Expected Behavior
Instances should report as successfully launched. They are
ACTIVE
and our install would continue as expected if terraform recognized the instances as launchedActual Behavior
Instance creation continued, waiting for health status to move to
WARNING
Steps to Reproduce
This is not consistently reproducable, however I've seen it pretty much every time I launch in
tor01
.Launch a
pi_instance
via terraform. In some cases, a VM will never leavePENDING
health status, causing the VM to never report as launched.Important Factoids
This has caused a significant number of failures in OpenShift IPI on Power VS.
We think the health status is not a reliable test of an instance being available and would prefer server status as
ACTIVE
to be the indicator of an instance being available.References
0000