Closed ae-govau closed 1 year ago
@ragaskar Could you take a look at this?
Hi:
Thanks for letting us know about this and apologies it has impacted you without warning. There's been a number of changes to group fetching through the Policy API this year. Originally, the prior code was using "group_id" to retrieve Groups. This was different from how the Managment API retrieved NS Groups (on display_name), and also would cause issues when the display name did not match the ID so we updated them to be in sync.
Later this year, we became concerned this would break users who previously were able to use a group_id in order to fetch groups, so we recently made the commit that is causing you problems. That said, this fetch used "id" vs. "group_id" and I suspect the other parameter may have more flexibility in what it accepts (the API documentation here is not entirely clear). Clearly, given your experience it is not working well. I'll take a look immediately to see if we can fix this.
FWIW, BOSH itself does not create Groups itself (although it is possible that other software using BOSH may be creating Groups separately via the API).
For now I've reverted this change and will add a note to the release alerting users.
I am still concerned that the changes we made in 2d28f5d236d8e24432bd190367447d1074a9be81 may affect some users and so we will investigate ways to ensure that users who happen to be passing IDs will still be supported (whether it be via fallback or some other means). I wonder if you would be willing to help us test these changes once we have a candidate build?
One last thing: do you mind sharing the display_names/IDs of groups you've experienced issues with? I'd like to add these as examples to our test suite.
Thanks for letting us know about this and apologies it has impacted you without warning. There's been a number of changes to group fetching through the Policy API this year. Originally, the prior code was using "group_id" to retrieve Groups. This was different from how the Managment API retrieved NS Groups (on display_name), and also would cause issues when the display name did not match the ID so we updated them to be in sync.
Thanks for getting back to us. We may have dodged that previous code, as we may have still been on NSX-V at that time.
Later this year, we became concerned this would break users who previously were able to use a group_id in order to fetch groups, so we recently made the commit that is causing you problems. That said, this fetch used "id" vs. "group_id" and I suspect the other parameter may have more flexibility in what it accepts (the API documentation here is not entirely clear). Clearly, given your experience it is not working well. I'll take a look immediately to see if we can fix this.
Thanks. The names of the NSX security groups applied to VMs by BOSH is big part of our system boundary in terms of how we interact with other teams, as that gets put into network firewall rules etc. That's why a change like this is somewhat significant to us.
FWIW, BOSH itself does not create Groups itself (although it is possible that other software using BOSH may be creating Groups separately via the API).
The BOSH vSphere CPI most certainly does create NSX groups, at least within NSX-V environments. I don't know whether that's still true in NSX-T. See: https://github.com/cloudfoundry/bosh-vsphere-cpi-release/blob/003c508d70001a44ab3c5179159ff72cb43da6c5/src/vsphere_cpi/lib/cloud/vsphere/nsx.rb#L27
I am still concerned that the changes we made in 2d28f5d may affect some users and so we will investigate ways to ensure that users who happen to be passing IDs will still be supported (whether it be via fallback or some other means). I wonder if you would be willing to help us test these changes once we have a candidate build?
Yes, we'd be willing to help test. We can do that easily if there's a test BOSH release (tarball) somewhere we can point to (ideally from your CI system) - if it's source only, that's harder for us.
One last thing: do you mind sharing the display_names/IDs of groups you've experienced issues with? I'd like to add these as examples to our test suite.
So one thing I'll mention, we deploy CF to a number of different environments, and we use the isolation segment feature in CF to further segment. We generate our Cloud Config by a script, including the env name and other things in that name, so that's another reason why we want to refer to them by a name, and not an ID that isn't under our control. In this particular case I suspect the groups may have been created for us ahead of time (when firewall rules were being created that utilise them).
e.g., the problematic one looked like this:
Name: cf-test-application-foo-is
ID: security-group-1234
Thanks for the update!
I have a new release for you with the problematic commit reverted: https://github.com/cloudfoundry/bosh-vsphere-cpi-release/releases/tag/v84
Please let me know if you still have trouble.
The BOSH vSphere CPI most certainly does create NSX groups, at least within NSX-V environments. I don't know whether that's still true in NSX-T. See:
Ah! I admit I am relatively new to the vSphere CPI, although I have looked through quite a bit of the NSX-T CPI code. Again, I do not think we actively create Groups when using the Policy API with NSX-T, but I will definitely double-check that assumption after seeing the code snippet you shared with me.
I'm a little taken aback that the NSX-V behavior has wound up having an effect/reverberations on the NSX-T behavior. Am I understanding correctly that you have migrated from NSX-V to NSX-T, all under BOSH management? Just want to make sure I understand the possible use case so I can consider it when we are making any future changes.
Yes, we'd be willing to help test. We can do that easily if there's a test BOSH release (tarball) somewhere we can point to (ideally from your CI system) - if it's source only, that's harder for us.
Thank you so much! Yes, I can cut you a tarball. I have some changes that I think will address the concern I have and work for your use case. Once I ensure they pass our tests I'd love to have you try them out.
Thanks for sharing the example Group name/id: that helps us understand how the CPI is used in the wild.
And again, thanks so much for reporting this! I think this has helped us avoid breaking other folks. Again, apologies we broke you.
I have a new release for you with the problematic commit reverted: https://github.com/cloudfoundry/bosh-vsphere-cpi-release/releases/tag/v84
Thanks, we'll aim to bump to that next week - for now we pinned back to v81 and that has fixed our environments.
I'm a little taken aback that the NSX-V behavior has wound up having an effect/reverberations on the NSX-T behavior. Am I understanding correctly that you have migrated from NSX-V to NSX-T, all under BOSH management?
Yes - we have a number of BOSH environments, and some of them are on systems that started as NSX-V and were then updated to NSX-T - we had to make corresponding cloud-config and other changes at the time.
I do suspect though that this particularly environment is one where we may have supplied the required firewall rules (which referenced NSX security group names) ahead of time, before we deployed Cloudfoundry there.
As such the team that created the firewall rules likely did so via the UI, and perhaps that explains why the name (which we supplied) doesn't match the ID (which perhaps NSX-V generated).
In some of our other environments, perhaps the BOSH generated IDs are the same as names, in cases where BOSH created the groups. I'm not able to explore those right now.
And again, thanks so much for reporting this! I think this has helped us avoid breaking other folks. Again, apologies we broke you.
You're most welcome - we are always happy to report issues to teams that are receptive to helping us work through them.
Newer versions resolved this for us. Happy to close out from our end.
Describe the bug
https://github.com/cloudfoundry/bosh-vsphere-cpi-release/pull/341 was recently merged, and we are no longer able to deploy VMs.
To Reproduce
Steps to reproduce the behavior:
I'm pretty sure that all of our NSX security groups were created by BOSH itself against NSX (probably NSXV, before we updated to T), but it's possibly they were created manually through the UI.
Either way, this is a very disruptive breaking change - and really should be called as such.
Can we consider falling back to searching by name if none match by ID?
CPI Error Log
Expected behavior
It to work like it used to.
Release Version & Related Info (please complete the following information):