alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 15 forks source link

Update network settings #586

Closed jemrobinson closed 1 year ago

jemrobinson commented 4 years ago

:white_check_mark: Checklist

:strawberry: Suggested change

Better separation between networks in the SHM and SRE.

:steam_locomotive: How could this be done?

Draft proposal for network settings for SHM and SRE. This top-level comment should be edited to reflect any discussion in the issue below.

SHM Main: 10.0.0.0/24 [=> 10.0.0.0 - 10.0.0.255] Name Example CIDR Attached NSG Usage
GatewaySubnet 10.0.0.0/26 (59 available IPs) (should probably be locking down inbound access) SHM VPN
AzureFirewallSubnet 10.0.0.64/26 (59 available IPs) "Subnet level NSGs aren't required on the AzureFirewallSubnet" SHM Firewall - NB. must be at least /26
MonitoringSubnet 10.0.0.128/27 (27 available IPs) NSG_SHM ID_SUBNET_MONITORING Automation and logging
UpdateServersSubnet 10.0.0.160/27 (27 available IPs) NSG_SHM ID_SUBNET_UPDATE_SERVERS Linux/Windows update servers
ControlSubnet 10.0.0.192/27 (27 available IPs) NSG_SHM ID_SUBNET_CONTROL DCs
PolicySubnet 10.0.0.224/27 (27 available IPs) NSG_SHM ID_SUBNET_POLICY NPS

VPN clients: 10.0.1.0/24 [=> 10.0.1.0 - 10.0.1.255] NB. this cannot overlap with other VNets

SHM Tier-2 mirrors: 10.0.2.0/24 [=> 10.0.2.0 - 10.0.2.255] Name Example CIDR Attached NSG Usage
InternalRepositoriesTier2Subnet 10.0.2.0/25 (123 available IPs) NSG_SHM ID_INTERNAL_REPOSITORIES_TIER2 Internal user-accessible package repositories
ExternalRepositoriesTier2Subnet 10.0.2.128/25 (123 available IPs) NSG_SHM ID_EXTERNAL_REPOSITORIES_TIER2 External non-accessible package repositories
External package repositories
SHM Tier-3 mirrors: 10.0.3.0/24 [=> 10.0.3.0 - 10.0.3.255] Name Example CIDR Attached NSG Usage
InternalRepositoriesTier3Subnet 10.0.3.0/25 (123 available IPs) NSG_SHM ID_INTERNAL_REPOSITORIES_TIER3 Internal user-accessible package repositories
ExternalRepositoriesTier3Subnet 10.0.3.128/25 (123 available IPs) NSG_SHM ID_EXTERNAL_REPOSITORIES_TIER3 External non-accessible package repositories
External package repositories

SRE:

Name Example CIDR Attached NSG Usage
DeploymentSubnet 10.1.0.0/24 (251 available IPs) NSG_SRE ID_DEPLOYMENT VM deployment
RemoteDesktopGatewaySubnet 10.1.1.0/25 (123 available IPs) NSG_SRE ID_REMOTE_DESKTOP_GATEWAY Remote desktop gateway
RemoteDesktopAuxiliarySubnet 10.1.1.128/25 (123 available IPs) NSG_SRE ID_REMOTE_DESKTOP_AUXILIARY Remote desktop auxiliary servers
PrivateDataSubnet 10.1.2.0/24 (123 available IPs) Private data endpoints
DatabasesSubnet 10.1.3.0/24 (123 available IPs) NSG_SRE ID_DATABASES Databases
UserServicesSubnet 10.1.4.0/24 (123 available IPs) NSG_SRE ID_USER_SERVICES CoCalc, GitLab, HackMD
UserSRDSubnet 10.1.5.0/24 (123 available IPs) NSG_SRE ID_USER_DESKTOP SRDs

Note that this scheme gives room to expand the SRE regions further into 10.1.5.* -- 10.1.7.* as required. For example, an HPC cluster could be added at 10.1.5.0/24 or a review subnet could be incorporated.

ens-george-holmes commented 4 years ago

We're in agreement here on the general approaching for carving up the subnets.

I'll work up some detail as part of #589 and then loop back here.

ens-george-holmes commented 4 years ago

I've updated the main comment to make the subnets more specific to a particular type of service/infrastructure.

Attaching an NSG to each subnet will make it easy to manage inter-subnet and inter-VPN security, as the rules will be written for the subnet's address prefix.

Likewise, this approach makes the perimeter firewall rules relatively easy to write, because everything is at the subnet level.

From a security viewpoint, isolating ETL infrastructure for production (MS SQL) databases is a good step. ETL can talk to whether the source data comes from and the prod database subnet, but the prod database subnet can't talk directly to the infra where the source data is sitting.

The artefacts I've committed to the uhb-deployment branch support rolling up a subnet and an NSG into a single composition. ARM incremental updates can then be leveraged to change the NSG rules.

ens-brett-todd commented 4 years ago

Please avoid using subnet named "GatewaySubnet" for anything other than VPN or ER Gateway use (it is reserved by Azure for this purpose)

jemrobinson commented 4 years ago

@martintoreilly: see my updated proposal in the top-level comment. One thing that we'll have to consider is how to deal with deployment into a VNet that already has an NSG attached. Possibilities are:

  1. open up a temporary hole in the NSG allowing outbound access from and disabling VNet access from (possibly with another explicit exemption for the SHM DC or NPS server as appropriate)
  2. deploy into a separate VNet with a more permissive NSG. However this will still need to be paired to the SHM VNet in order for domain joining and changing the IP address post-deployment will break the domain join (plus potentially some other things) anyway
  3. something else?

I think that (1) is the best/easiest thing to do, but would like to know what you think.

martintoreilly commented 4 years ago

@martintoreilly: see my updated proposal in the top-level comment.

Looks good in general. I'd tend to make the subnets smaller where we know there will only be a handful of VMs (e.g. /27s rather than /25s for things like the RDS gateway, SHM DCs, NPS etc.

I'd also suggest we separate the web app servers into a separate subnet so we can restrict access to HTTP/S from everywhere except the admin VPN subnet. Thinking about locking down the Gitlab review and internal servers, I'm wondering if the HackMD server should be on another subnet to the internal Gitlab servers. I suspect we may also find that we may want to separate the next VM we want to put in an airlock from the Gitlab review VM.

Edit: though I guess we could have a basic set of rules that apply to all VMs in the web app subnet (e.g. no outbound, HTTP/S only from user subnet, SSH + RDS from admin VPN subnet) then specific exemptions for each VM that needs it (e.g. HTTP/S and SSH access from Gitlab review for Gitlab internal; SSH from user subnet for GItlab internal; RDS from researcher app RDS SH for Gitlab internal and HackMD; RDS from reviewer app RDS SH for Gitlab review etc)

One thing that we'll have to consider is how to deal with deployment into a VNet that already has an NSG attached. Possibilities are:

  1. open up a temporary hole in the NSG allowing outbound access from and disabling VNet access from (possibly with another explicit exemption for the SHM DC or NPS server as appropriate)
  2. deploy into a separate VNet with a more permissive NSG. However this will still need to be paired to the SHM VNet in order for domain joining and changing the IP address post-deployment will break the domain join (plus potentially some other things) anyway
  3. something else?

I think that (1) is the best/easiest thing to do, but would like to know what you think.

(1) sounds ok to me as long as we get the sequencing right. Something like?

jemrobinson commented 4 years ago

@martintoreilly The main justification behind my choices of CIDRs were:

If you want to suggest any specific changes to this scheme that's fine with me.

martintoreilly commented 4 years ago

@jemrobinson @JimMadge Prompted by conversations on the "SRE index" work I did in PR #786, are we happy with limiting ourselves to a maximum of ~255 SREs per SHM by moving to an SRE virtual network CIDR as large as a /16? While we're not hitting that limit here, we might if we choose to offer a "Safe Haven as a service" (or of someone else did, or if a larger organisation wanted to use our implementation).

I do like being able to know which type of VM I'm dealing with by looking at the third octet, but 255 feels quite small for a "no-one's ever going to need more than X SREs" statement. With the current /21 we are limited to ~8,192 SREs (though, as for the 255 in the /16 case some of these are reserved for the SHM vnets).

jemrobinson commented 4 years ago

Using eg. 10.10.0.0/21 for an SRE would certainly be possible, but we might need to rethink the SRE subnet strategy a bit. The current one was designed so that if/when we add new subnets, they are backwards compatible as they are using IP ranges that are currently unallocated. Should we try to shrink the subnet sizes to keep this feature or should we say that it's not that important in practice?

JimMadge commented 4 years ago

255 SREs per SHM does sound a bit small, considering the scenario where an organisation may want to control access with a single SHM and requires one SRE per work package.

It does feel more likely that you would want more than 255 SREs than you would need more than 65,536 IPs within a single SRE.

One advantage of the proposal above is that the octets are meaningful and easy to inspect; 10.1.x.x is part of the VPN subnet, 10.2.x.x is part of the tier2 mirror subnet and so on. I expect we would necessarily loose some of these characteristic IPs.

Alternatively, move everything to IPv6 (probably not a serious suggestion).

jemrobinson commented 4 years ago

I've updated the top-level proposal so that the SRE is based on a 10.10.0.0/21 VNet. @JimMadge / @martintoreilly - are you happy with that? We could reduce this further (eg. to 10.10.0.0/22 giving 252 * 64 = 16128 SREs per SHM).

JimMadge commented 4 years ago

Both of those sound reasonable to me. Do we foresee any risk of making the SRE subnets too small? With webapps and possibly extra DSVMs for GPU, high memory etc. how many IPs would we expect to need in total? >255?

jemrobinson commented 4 years ago

At the moment I think we actually use <16 IP addresses per SRE. Using /21 gives us ~2000 (depends on exactly how we split these into subnets as you lose 5 addresses each time). We'd need a lot of clusters (or something similar) to get us close to even 255 let alone 2000.

jemrobinson commented 3 years ago

Some comments from @martintoreilly on #877:

Compute VMs

RDS Gateway

RDS Session hosts

General

JimMadge commented 2 years ago

@jemrobinson is this work (or as much as we are interesting in doing now) done? Shall we close the issue?

jemrobinson commented 2 years ago

It's not done, and I'm still interested in this for a future iteration.

jemrobinson commented 1 year ago

This is done in Pulumi and will not be implemented in Powershell.