TAS deployment failure during network churn

Stemcell: ubuntu-xenial/621.261 Bosh: 2.10.46-build.541 TAS 2.13.8

Issue description:

While we are trying to scale diego cell count in our TAS env and having network/org churn running in parallel, we are facing below issue due to which deployment is failing. This issue can also be observed during NSXT tile upgrade from ops manager and having network/org churn running in parallel.

Task 96 | 09:05:47 | Creating missing vms: diego_cell/a8a9036f-bc39-4412-b73a-0b920b3050de (14) (00:53:37)
Task 96 | 09:05:56 | Creating missing vms: diego_cell/cf1f94cd-790b-407b-bedb-eea685604974 (15) (00:53:46)
                  L Error: Unknown CPI error 'Unknown' with message 'The object 'vim.dvs.DistributedVirtualPortgroup:dvportgroup-22218' has already been deleted or has not been completely created' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-432182')
Task 96 | 09:05:56 | Creating missing vms: diego_cell/57211cee-682c-4023-b84e-77331e12ac5c (17) (00:53:46)
                  L Error: Unknown CPI error 'Unknown' with message 'The object 'vim.dvs.DistributedVirtualPortgroup:dvportgroup-22218' has already been deleted or has not been completely created' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-668394')
Task 96 | 11:25:18 | Creating missing vms: diego_cell/06d316fe-1269-4b41-bd2c-d48f180ed3fe (18) (03:13:08)
Task 96 | 11:28:38 | Creating missing vms: diego_cell/205ec15c-0873-4079-8d06-68df49bd8c00 (13) (03:16:28)
Task 96 | 11:30:39 | Creating missing vms: diego_cell/3e9a5f94-21be-470a-a8b8-de54b35486f8 (10) (03:18:29)
Task 96 | 11:31:57 | Creating missing vms: diego_cell/37d64fad-c101-4fc3-a2d1-bb378e7e85d4 (19) (03:19:47)
Task 96 | 11:31:57 | Error: Unknown CPI error 'Unknown' with message 'The object 'vim.dvs.DistributedVirtualPortgroup:dvportgroup-22218' has already been deleted or has not been completely created' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-432182')
Task 96 Started  Thu Sep 22 08:09:19 UTC 2022
Task 96 Finished Thu Sep 22 11:31:57 UTC 2022
Task 96 Duration 03:22:38
Task 96 error

Updating deployment:
 Expected task '96' to succeed but state is 'error'
Exit code 1
===== 2022-09-22 11:31:57 UTC Finished "/usr/local/bin/bosh --no-color --non-interactive --tty --environment=192.168.2.21 --deployment=cf-e46963be09f30ce93dca deploy --no-redact /var/tempest/workspaces/default/deployments/cf-e46963be09f30ce93dca.yml"; Duration: 12201s; Exit Status: 1
Exited with 1.
Exited with 1.

We have more than 600 orgs/logical segments created in vcenter and deleting those LS during deployment is causing above issue

Looks like this issue is related to https://github.com/cloudfoundry/bosh-vsphere-cpi-release/pull/332 which was supposed to be fixed in cpi with Bosh: 2.10.46-build.541 .

To Reproduce Steps to reproduce the behavior: This issue can be reproduced while running org/network churn during bosh vm update/creation

CPI Error Log Attached CPI error logs: task_96_cpi.txt Attached bosh director logs: bosh_logs.tgz Attached debug logs: task_96_debug.txt

Expected behavior Any kind of deployment, either TAS or NCP upgrade or diego cell scaling should be successful while org/network churn is running

Screenshots Attached Screenshot: OPSMAN.png ERROR_LOG.png

Release Version & Related Info (please complete the following information):

CPI Version:
BOSH Director Version: 2.10.46-build.541
Stemcell Name & Version: ubuntu-xenial/621.261
vCenter Version: 7.0.3.18700403

Additional context

Looks like due to above issue, deployment was running longer than usual and eventually failed with above mentioned error

Attached files:

task_96_cpi.txt

bosh_logs.tgz

task_96_debug.txt

cloudfoundry / bosh-vsphere-cpi-release

TAS deployment failure during network churn #336