Closed jemrobinson closed 1 year ago
We're in agreement here on the general approaching for carving up the subnets.
I'll work up some detail as part of #589 and then loop back here.
I've updated the main comment to make the subnets more specific to a particular type of service/infrastructure.
Attaching an NSG to each subnet will make it easy to manage inter-subnet and inter-VPN security, as the rules will be written for the subnet's address prefix.
Likewise, this approach makes the perimeter firewall rules relatively easy to write, because everything is at the subnet level.
From a security viewpoint, isolating ETL infrastructure for production (MS SQL) databases is a good step. ETL can talk to whether the source data comes from and the prod database subnet, but the prod database subnet can't talk directly to the infra where the source data is sitting.
The artefacts I've committed to the uhb-deployment branch support rolling up a subnet and an NSG into a single composition. ARM incremental updates can then be leveraged to change the NSG rules.
Please avoid using subnet named "GatewaySubnet" for anything other than VPN or ER Gateway use (it is reserved by Azure for this purpose)
@martintoreilly: see my updated proposal in the top-level comment. One thing that we'll have to consider is how to deal with deployment into a VNet that already has an NSG attached. Possibilities are:
I think that (1) is the best/easiest thing to do, but would like to know what you think.
@martintoreilly: see my updated proposal in the top-level comment.
Looks good in general. I'd tend to make the subnets smaller where we know there will only be a handful of VMs (e.g. /27
s rather than /25
s for things like the RDS gateway, SHM DCs, NPS etc.
I'd also suggest we separate the web app servers into a separate subnet so we can restrict access to HTTP/S from everywhere except the admin VPN subnet. Thinking about locking down the Gitlab review and internal servers, I'm wondering if the HackMD server should be on another subnet to the internal Gitlab servers. I suspect we may also find that we may want to separate the next VM we want to put in an airlock from the Gitlab review VM.
Edit: though I guess we could have a basic set of rules that apply to all VMs in the web app subnet (e.g. no outbound, HTTP/S only from user subnet, SSH + RDS from admin VPN subnet) then specific exemptions for each VM that needs it (e.g. HTTP/S and SSH access from Gitlab review for Gitlab internal; SSH from user subnet for GItlab internal; RDS from researcher app RDS SH for Gitlab internal and HackMD; RDS from reviewer app RDS SH for Gitlab review etc)
One thing that we'll have to consider is how to deal with deployment into a VNet that already has an NSG attached. Possibilities are:
- open up a temporary hole in the NSG allowing outbound access from and disabling VNet access from (possibly with another explicit exemption for the SHM DC or NPS server as appropriate)
- deploy into a separate VNet with a more permissive NSG. However this will still need to be paired to the SHM VNet in order for domain joining and changing the IP address post-deployment will break the domain join (plus potentially some other things) anyway
- something else?
I think that (1) is the best/easiest thing to do, but would like to know what you think.
(1) sounds ok to me as long as we get the sequencing right. Something like?
@martintoreilly The main justification behind my choices of CIDRs were:
If you want to suggest any specific changes to this scheme that's fine with me.
@jemrobinson @JimMadge Prompted by conversations on the "SRE index" work I did in PR #786, are we happy with limiting ourselves to a maximum of ~255 SREs per SHM by moving to an SRE virtual network CIDR as large as a /16
? While we're not hitting that limit here, we might if we choose to offer a "Safe Haven as a service" (or of someone else did, or if a larger organisation wanted to use our implementation).
I do like being able to know which type of VM I'm dealing with by looking at the third octet, but 255 feels quite small for a "no-one's ever going to need more than X SREs" statement. With the current /21
we are limited to ~8,192 SREs (though, as for the 255 in the /16
case some of these are reserved for the SHM vnets).
Using eg. 10.10.0.0/21
for an SRE would certainly be possible, but we might need to rethink the SRE subnet strategy a bit. The current one was designed so that if/when we add new subnets, they are backwards compatible as they are using IP ranges that are currently unallocated. Should we try to shrink the subnet sizes to keep this feature or should we say that it's not that important in practice?
255 SREs per SHM does sound a bit small, considering the scenario where an organisation may want to control access with a single SHM and requires one SRE per work package.
It does feel more likely that you would want more than 255 SREs than you would need more than 65,536 IPs within a single SRE.
One advantage of the proposal above is that the octets are meaningful and easy to inspect; 10.1.x.x is part of the VPN subnet, 10.2.x.x is part of the tier2 mirror subnet and so on. I expect we would necessarily loose some of these characteristic IPs.
Alternatively, move everything to IPv6 (probably not a serious suggestion).
I've updated the top-level proposal so that the SRE is based on a 10.10.0.0/21
VNet. @JimMadge / @martintoreilly - are you happy with that? We could reduce this further (eg. to 10.10.0.0/22
giving 252 * 64 = 16128 SREs per SHM).
Both of those sound reasonable to me. Do we foresee any risk of making the SRE subnets too small? With webapps and possibly extra DSVMs for GPU, high memory etc. how many IPs would we expect to need in total? >255?
At the moment I think we actually use <16 IP addresses per SRE. Using /21
gives us ~2000 (depends on exactly how we split these into subnets as you lose 5 addresses each time). We'd need a lot of clusters (or something similar) to get us close to even 255 let alone 2000.
Some comments from @martintoreilly on #877:
@jemrobinson is this work (or as much as we are interesting in doing now) done? Shall we close the issue?
It's not done, and I'm still interested in this for a future iteration.
This is done in Pulumi and will not be implemented in Powershell.
:white_check_mark: Checklist
:strawberry: Suggested change
Better separation between networks in the SHM and SRE.
:steam_locomotive: How could this be done?
Draft proposal for network settings for SHM and SRE. This top-level comment should be edited to reflect any discussion in the issue below.
10.0.0.0/24
[=> 10.0.0.0 - 10.0.0.255
]10.0.0.0/26
(59 available IPs)10.0.0.64/26
(59 available IPs)/26
10.0.0.128/27
(27 available IPs)SHM ID
_SUBNET_MONITORING10.0.0.160/27
(27 available IPs)SHM ID
_SUBNET_UPDATE_SERVERS10.0.0.192/27
(27 available IPs)SHM ID
_SUBNET_CONTROL10.0.0.224/27
(27 available IPs)SHM ID
_SUBNET_POLICYVPN clients:
10.0.1.0/24
[=> 10.0.1.0 - 10.0.1.255
] NB. this cannot overlap with other VNets10.0.2.0/24
[=> 10.0.2.0 - 10.0.2.255
]10.0.2.0/25
(123 available IPs)SHM ID
_INTERNAL_REPOSITORIES_TIER210.0.2.128/25
(123 available IPs)SHM ID
_EXTERNAL_REPOSITORIES_TIER210.0.3.0/24
[=> 10.0.3.0 - 10.0.3.255
]10.0.3.0/25
(123 available IPs)SHM ID
_INTERNAL_REPOSITORIES_TIER310.0.3.128/25
(123 available IPs)SHM ID
_EXTERNAL_REPOSITORIES_TIER3SRE:
10.1.0.0/21
[=> 10.1.0.0 - 10.1.7.255
]10.1.8.0/21
[=> 10.1.8.0 - 10.1.15.255
]10.1.0.0/24
(251 available IPs)SRE ID
_DEPLOYMENT10.1.1.0/25
(123 available IPs)SRE ID
_REMOTE_DESKTOP_GATEWAY10.1.1.128/25
(123 available IPs)SRE ID
_REMOTE_DESKTOP_AUXILIARY10.1.2.0/24
(123 available IPs)10.1.3.0/24
(123 available IPs)SRE ID
_DATABASES10.1.4.0/24
(123 available IPs)SRE ID
_USER_SERVICES10.1.5.0/24
(123 available IPs)SRE ID
_USER_DESKTOPNote that this scheme gives room to expand the SRE regions further into
10.1.5.* -- 10.1.7.*
as required. For example, an HPC cluster could be added at 10.1.5.0/24 or a review subnet could be incorporated.