Bug Report - Enable DDoS Std policy requires permissions on the central plan for workload owners

Azure / Enterprise-Scale

The Azure Landing Zones (Enterprise-Scale) architecture provides prescriptive guidance coupled with Azure best practices, and it follows design principles across the critical design areas for organizations to define their Azure architecture

https://aka.ms/alz

MIT License

1.72k stars 975 forks source link

Bug Report - Enable DDoS Std policy requires permissions on the central plan for workload owners #1053

Closed pazdedav closed 1 year ago

pazdedav commented 2 years ago

Describe the bug When a workload owner of an Online LZ subscription wants to create a new Virtual Network, they get an error similar to this:

(LinkedAuthorizationFailed) The client 'XX' with object id 'c0602d84-4705-405c-a897-0c4718407f9b' has permission to perform action 'Microsoft.Network/virtualNetworks/write' on scope '/subscriptions/{guid}/resourceGroups/bug25903-test-rg/providers/Microsoft.Network/virtualNetworks/dapazd-vnet'; however, it does not have permission to perform action 'Microsoft.Network/ddosProtectionPlans/join/action' on the linked scope(s) '/subscriptions/{guid}/resourceGroups/eacp-ddos/providers/Microsoft.Network/ddosProtectionPlans/eacp-ddos-norwayeast' or the linked scope(s) are invalid. Code: LinkedAuthorizationFailed Message: The client 'X' with object id 'c0602d84-4705-405c-a897-0c4718407f9b' has permission to perform action 'Microsoft.Network/virtualNetworks/write' on scope '/subscriptions/{guid}/resourceGroups/bug25903-test-rg/providers/Microsoft.Network/virtualNetworks/dapazd-vnet'; however, it does not have permission to perform action 'Microsoft.Network/ddosProtectionPlans/join/action' on the linked scope(s) '/subscriptions/{guid}/resourceGroups/eacp-ddos/providers/Microsoft.Network/ddosProtectionPlans/eacp-ddos-norwayeast' or the linked scope(s) are invalid.

I assume this error is caused by the fact that ALZ reference implementation (accelerators) is using a centralized DDoS protection plan deployed to the Connectivity subscription, where a regular workload owner doesn't have any access. The requirement for such permission is documented here.

How can this issue be addressed? The obvious option is to grant Microsoft.Network/ddosProtectionPlans/join/action permission (e.g., by creating a custom role) for the central DDoS Plan to all workload owners / users but this might not scale well, since people will come and go.

Screenshots

jtracey93 commented 2 years ago

Hey @pazdedav,

Thanks for raising.

Couple of questions if you don't mind

Can you confirm the object ID of: c0602d84-4705-405c-a897-0c4718407f9b is the user trying to create the VNET and would be the app/workload owner for example?
Do you think creating a custom role definition with the below actions and assigning it over the DDoS Plan in the Connectivity Subscription is something customers would be okay with as the fix from a security standpoint?
- Microsoft.Network/ddosProtectionPlans/read - so they can see it at least and create support tickets against it if required?
- Microsoft.Network/ddosProtectionPlans/join/action - so they can perform the join action for their VNETs?

Alternatives

Discuss with the DDoS PG (who own the policy) about making this a DINE policy instead so its always triggered after the VNET is created and therefore the Managed Identity assigned to the DINE policy assignment can have the required permissions only and not giving RBAC to other users as per suggestions above
- We could also make it an ALZ custom DINE policy in the gap if PG conversation takes too long

Let me know your thoughts

Cheers

Jack

pazdedav commented 2 years ago

Hi @jtracey93:

I can confirm that the objectID from the screenshot/error message was the user owning the LZ with standard Contributor role in that subscription (but no permissions on the Connectivity sub).
Custom roles are tricky (harder to use), and we tried to stay away from those as long as we can. But even if we go that path, this will still require a complex management of role assignment, since all current and new LZ owners/users would need this role assignment.
I am not sure if changing this policy to DINE is going to fix it, since the user needs to have permissions to execute the "original new VNet request" that is blocked due to missing perms. on the DDoS Plan, so a DINE policy wouldn't even kick in, IMHO.

jtracey93 commented 2 years ago

Thanks @pazdedav

On the DINE side as it's a post deployment operation rather than part of the initial create operations (which happens as the user making the request) then the managed identity associated to the DINE policy would be the only identity that needs permissions.

You could also do the same for modify as a remediation task as they work the same as DINE. But the issue here is that modify is trying to modify the PUT request as the user makes it which means it's seen as the user making the PUT rather than a managed identity (as would be used if it was a remediation task).

Hence my suggestion around DINE as would always happen post VNET creation as the managed identity which could be granted permissions.

On the custom RBAC though I think this could be easily done with groups to assign all LZ owners too. One would likely already exist for potential communications around the companies azure platform for example.

Thoughts with these pieces of additional information?

pazdedav commented 2 years ago

Thanks for additional input.

I misunderstood you first, but I got it now. You would like the PG to create a new built-in DINE policy that would enable a centralized DDoS plan, so we could remove the assignment of current Modify policy that is blocking users from even creating VNets. I wonder if we are dependent on the PG here or if this could be done by the ALZ team in the same way that all other ALZ policies were created ;)

Our subscription vending process is creating new AAD groups and assigns the requestor as owner (to manage group membership), so we would either need to do an extra role assignment of that custom role for the plan as part of the workflow or we would need to add this group to some top-level group that would have that role assignment done in advance. We (the platform team) don't know our customers/users in advance, unless we want to create some dynamic AAD group for essentially all (Azure) users in the company.

I thought for a while that alternatively, we could have an event-driven automation that would add each new VNet to the plan asynchronously (not by using a policy but with Event Grid + Functions), but this wouldn't work, since every change of VNet properties done by the user would still require the join permission on that plan. Just thinking about it, wouldn't this requirement mean that the DINE policy alone won't work? Even if it adds the DDoS plan property after the initial request passes, any subsequent VNet modification by a user would fail due to the lack of that join permission?

jtracey93 commented 2 years ago

The DINE policy would run all its activities as its own identity so as long as that identity alone is granted the join action on the DDoS plan and has permissions on the VNETs we are all good. Nothing would run as the user with DINE.

And you are right, we could investigate doing it ourselves in ALZ, but we try to avoid where possible as built-in makes life easier for all involved (us, customers, etc.) - (I hinted towards this in my first reply 😁)

@pazdedav let me know your thoughts and if we should look into the DINE route in your opinion?

pazdedav commented 2 years ago

Thank you @jtracey93. I am however still struggling to see, how the "DINE route" alone would work without assigning a custom role with join permissions to users.

This is how I envision this workflow would work (or not):

A user in a landing zone (without any permissions on the DDoS plan) would create a new VNet. Let's say in the Portal. Since we would remove the current Modify policy for DDoS, this request would be accepted and processed.
A DINE policy would kick in with a Managed Identity (and permissions to both the landing zone and the DDoS plan) and update the VNet with DDoS plan property (join operation). So far so good.
When a user wants to modify properties of that VNet later, this operation/request would, IMO, fail, because Once a DDoS Protection Plan has been enabled on a Virtual Network, subsequent operations on that Virtual Network still require the Microsoft.Network/ddosProtectionPlans/join/action action permission. as documented here. The user trying to make that change does not have that permission and the DINE policy would not be applied (since the DDoS property already exist on the VNet resource).

Perhaps I misunderstood that requirement from the DDoS documentation, but I guess we would know for sure once we have that DINE policy in a test environment :)

Another interesting question is, what would that DINE policy deploy? Both the DDoS plan and the VNet would already exist.

jtracey93 commented 2 years ago

Ah that is a very important point i glazed over @pazdedav. You are indeed correct about the note of:

Therefore, DINE would work initially but subsequent updates to the VNET would be an issue I suspect from that note (something to test though)

It'll all depend on how the VNET RP handles a partial put of just the DDoS plan property.

As you say one to test. Do you fancy testing this out and reporting back? (fine if not, but thought you may be interested?)

pazdedav commented 2 years ago

I could try to test a scenario (without having a DINE policy in place), where I would have two users:

One with rights on both the LZ and DDoS plan to create the VNet and "join" it to the plan.
Another user with e.g. Contributor rights in the LZ but not on the DDoS plan. I would modify some properties of the VNet to see what happens.

What do you think about my last comment, @jtracey93 ?

Another interesting question is, what would that DINE policy deploy? Both the DDoS plan and the VNet would already exist.

Even if we decide to create a new DINE policy, it is not clear to me what resource it would deploy, since both components would be in place already.

jtracey93 commented 2 years ago

@pazdedav If you can test my suggestion 2 (that I shared above here: https://github.com/Azure/Enterprise-Scale/issues/1053#issuecomment-1254131060) around RBAC that would be great.

At least thats a workaround validated 👍

As for the DINE stuff it will be the VNET that we would do a PUT on to add the DDoS plan to the VNET. But this would also need testing. Hence my comment around the VNET RP and wondering how it would behave. We can test that next 👍

JefferyMitchell commented 2 years ago

@pazdedav How did the testing go?

aparnabhat-gh commented 1 year ago

Hi @jtracey93

I was able to test this policy today and here are my findings:

As described in the above comments, assigning the policy in Modify mode only works if the user creating the VNet has atleast Microsoft.Network/ddosProtectionPlans/join/action permission on the DDOS plan.
For existing VNets, one needs to create the remediation task to add DDOS Plan to existing VNets. This works as the Managed Identity created during the policy assignment is added with Network Contributor role on the scope.
I was able to create a custom role with just Microsoft.Network/ddosProtectionPlans/join/action permission with Modify policy enabled and Vnet creation by the user (network contributor role on just the test resource group)succeeded. Although I did not try the cli/powershell, however via portal disabling the DDOS plan from Vnet errored out with the message that user doesnot have join permission on DDOS Plan ( I guess the error message needs to be updated with correct permission )
After adding the VNet to the DDOS Plan either manually or using the remediation task, the user (with network contributor role to the ResourceGroup/vnet & no access to DDOS Plan) was unable to add a new address space to the existing VNET VNET, however adding a new subnet/modify subnet/ adding bastion host (public IP for the bastion was added to the DDOS Plan) succeeded.

So, we can:

Enable DINE (DeployifnotExists) mode for the policy definition with option of automatically creating the remediation task for non-complaint resources to add existing VNETs to the DDOS Plan, however, this would create a limitation for users with no access to the plan to not able to do certain tasks on the VNET ( e.g adding new address space). I am not sure if there are other tasks that would fail. So we will need to add in the documentation that users managing VNETs will need the join permission on the DDOS Plan unless there is a way to work around it.

For customers, that have a central team managing the VNETs adding DDOS Plan might not be an issue, however for customer going with subscription democratisation, they would not want to give this access to application users.

jtracey93 commented 1 year ago

Thanks @aparnabhat-gh, for the work and detail here, much appreciated.

We discussed as a team on our weekly policy scrum and have some things we'd like to investigate before committing to a direction forward here and/or reach out to PGs for DDoS. The action plan is listed below:

We will update this issue with our findings and next steps

Action Plan

We will create a custom DINE policy to apply a DDoS Network Protection Plan to vNets that do not have it set and then test this policy
- The testing is the most important part here due to the vNet resources behaviour from an ARM/API perspective. We need to test the following scenarios:
  1. When a blank vNet (no subnets) has no DDoS Network Protection Plan associated, does the DINE policy remediate it correctly?
  2. When a blank vNet (no subnets), but has some other vNet properties set (like DNS Servers, Additional Address Space, flowTimeoutInMinutes, bgpCommunities) has no DDoS Network Protection Plan associated, does the DINE policy remediate it correctly and not amend the other set vNet properties?
  3. When a vNet with some subnets that have resources in them like a VM and with some subnets that are empty, has no DDoS Network Protection Plan associated, does the DINE policy remediate it correctly and not amend the subnets?
  4. When a vNet with some subnets that have resources in them like a VM and with some subnets that are empty, but has some other vNet properties set (like DNS Servers, Additional Address Space, flowTimeoutInMinutes, bgpCommunities) has no DDoS Network Protection Plan associated, does the DINE policy remediate it correctly and not amend the other set vNet properties?

Why do we need to test all this?

Mainly due to this known issue/behaviour on the vNet resource:

I personally suspect a DINE policy may not be possible here due to it also amending some additional properties, unless we create a complex policy to union a lot of the properties to ensure they aren't overridden. Hence the need for testing heavily.

cc: @paulgrimley & @Springstone for as we discussed for creation of a user story and assignment for testing etc.

paulgrimley commented 1 year ago

Story created for policy refresh planning AB#30269

Springstone commented 1 year ago

We have confirmed that using DINE is destructive and potentially dangerous to use to apply changes to a virtual network, (e.g. enabling only a DDOS plan on a virtual network, will reset all other parameters incl. address space, subnets, etc). We have escalated the issue to PG, but have no updates as yet.

Springstone commented 1 year ago

We've tried multiple paths to ask engineering to improve this experience and provide additional features/mechanisms to enable the scenario you've highlighted, but unfortunately our feedback is not being prioritized. We'll keep up the pressure, but in the short term we are dependent on engineering enhancements. You can help by raising support tickets for the same issue, having customer evidence helps move things along.

As ALZ team cannot improve the experience directly, will be closing the issue. Feel free to reopen or open a new one should you have further input.