Open pakrym opened 3 years ago
cc @benbp @heaths
There is a retry already, and the Az cmdlets have a back off period (doubles).
It's not RBAC for the newly-created resource group it's for the resources created as part of template deployment.
I see. But why make everyone wait if not necessary? That was the whole reason of test-resources-post.ps1 - to allow service-specific actions post deployment. For Key Vault, for example, I have to activate Managed HSM - something that can't be done via ARM templates by design. Most services don't have this problem. In fact, RBAC assignments for Managed HSM don't take this long (they are data-plane RBAC that mimic control-plane RBAC, so perhaps that's why).
The purpose of the scripts is you want is still met. Users don't run test-resources-(pre|post).ps1 themselves. The scripts do that, so users just run the scripts for a service and it all works out in the end.
I'm saying that unconditionally waiting is a bad solution and we need something better to recommend to partners.
The test-resources-post.ps1 workaround seems good. That's why those scripts are used if present. I could update documentation to detail some example uses if you think that would help.
The test-resources-post.ps1 workaround seems good.
I disagree, it's slow and unreliable. It doesn't guarantee that RBAC propagates.
I agree we should at least come up with a pattern for this RBAC propagation issue so that folks can use that in their post scripts if needed. @pakrym do you know if there is any way to query the status?
This is probably a better question for the ARM team. I'm not aware of any way to make sure RBAC has propagated, and it may well depend on some RPs' implementation. Still, I doubt this is something we could put in the New-TestResources.ps1 script unless there's some obvious way of detecting that RBAC permissions were even assigned - which can be done in test-resources-post.ps1, which Key Vault is doing for Managed HSM test resources.
There is Get-AzRoleAssignment
. A pattern in docmentation or even inheritted (via scope) function could maybe loop on that. Initially, I'm in favor of documenting a pattern since providing all the data necessary may be more verbose and cause compat issues later if we have to change it.
. @pakrym do you know if there is any way to query the status?
It seems that you have to actually call the service to know. @kasobol-msft is working on a prototype that allows partners to define a way to check if the environment is ready as part of the .net testing framework. We would make a simple call and see if it fails with auth error and keep retrying.
@heaths Problem is that Get-AzRoleAssignment
might give you the assignments but they might still in flight to arrive at storage service side. If storage backend hasn't yet consumed the change then calls to storage fail with 403.
I don't like idea of waiting for RBAC at resource provisioning step. That time can be used to advance pipeline (especially that pessimistic case is 5 minutes). Other thing is that I'd need to build something that probes some API and checks for 403s - this is easier with SDK in hand.
We have multiple services that hit a problem where their live tests are failing because RBAC permissions were not propagated in time. We also had many ways each of them tries to solve the problem:
By waiting pre-defined amount of time: https://github.com/Azure/azure-sdk-for-net/blob/master/sdk/eventhub/test-resources-post.ps1?rgh-link-date=2021-04-21T00%3A02%3A41Z#L15-L16
Or having retry on tests: https://github.com/Azure/azure-sdk-for-net/pull/20559
To me, the goal of
New-TestResources
is to leave the test environment fully ready for the test run and non-propagated RBAC is something we should solve in a centralized way.