Azure / azure-sdk-tools

Tools repository leveraged by the Azure SDK team.
MIT License
114 stars 177 forks source link

Provide a pattern to wait for RBAC propagation #1567

Open pakrym opened 3 years ago

pakrym commented 3 years ago

We have multiple services that hit a problem where their live tests are failing because RBAC permissions were not propagated in time. We also had many ways each of them tries to solve the problem:

By waiting pre-defined amount of time: https://github.com/Azure/azure-sdk-for-net/blob/master/sdk/eventhub/test-resources-post.ps1?rgh-link-date=2021-04-21T00%3A02%3A41Z#L15-L16

Or having retry on tests: https://github.com/Azure/azure-sdk-for-net/pull/20559

To me, the goal of New-TestResources is to leave the test environment fully ready for the test run and non-propagated RBAC is something we should solve in a centralized way.

pakrym commented 3 years ago

cc @benbp @heaths

kasobol-msft commented 3 years ago

https://github.com/Azure/azure-sdk-for-net/issues/17384

heaths commented 3 years ago

There is a retry already, and the Az cmdlets have a back off period (doubles).

pakrym commented 3 years ago

It's not RBAC for the newly-created resource group it's for the resources created as part of template deployment.

heaths commented 3 years ago

I see. But why make everyone wait if not necessary? That was the whole reason of test-resources-post.ps1 - to allow service-specific actions post deployment. For Key Vault, for example, I have to activate Managed HSM - something that can't be done via ARM templates by design. Most services don't have this problem. In fact, RBAC assignments for Managed HSM don't take this long (they are data-plane RBAC that mimic control-plane RBAC, so perhaps that's why).

The purpose of the scripts is you want is still met. Users don't run test-resources-(pre|post).ps1 themselves. The scripts do that, so users just run the scripts for a service and it all works out in the end.

pakrym commented 3 years ago

I'm saying that unconditionally waiting is a bad solution and we need something better to recommend to partners.

heaths commented 3 years ago

The test-resources-post.ps1 workaround seems good. That's why those scripts are used if present. I could update documentation to detail some example uses if you think that would help.

pakrym commented 3 years ago

The test-resources-post.ps1 workaround seems good.

I disagree, it's slow and unreliable. It doesn't guarantee that RBAC propagates.

weshaggard commented 3 years ago

I agree we should at least come up with a pattern for this RBAC propagation issue so that folks can use that in their post scripts if needed. @pakrym do you know if there is any way to query the status?

heaths commented 3 years ago

This is probably a better question for the ARM team. I'm not aware of any way to make sure RBAC has propagated, and it may well depend on some RPs' implementation. Still, I doubt this is something we could put in the New-TestResources.ps1 script unless there's some obvious way of detecting that RBAC permissions were even assigned - which can be done in test-resources-post.ps1, which Key Vault is doing for Managed HSM test resources.

heaths commented 3 years ago

There is Get-AzRoleAssignment. A pattern in docmentation or even inheritted (via scope) function could maybe loop on that. Initially, I'm in favor of documenting a pattern since providing all the data necessary may be more verbose and cause compat issues later if we have to change it.

pakrym commented 3 years ago

. @pakrym do you know if there is any way to query the status?

It seems that you have to actually call the service to know. @kasobol-msft is working on a prototype that allows partners to define a way to check if the environment is ready as part of the .net testing framework. We would make a simple call and see if it fails with auth error and keep retrying.

kasobol-msft commented 3 years ago

@heaths Problem is that Get-AzRoleAssignment might give you the assignments but they might still in flight to arrive at storage service side. If storage backend hasn't yet consumed the change then calls to storage fail with 403.

I don't like idea of waiting for RBAC at resource provisioning step. That time can be used to advance pipeline (especially that pessimistic case is 5 minutes). Other thing is that I'd need to build something that probes some API and checks for 403s - this is easier with SDK in hand.