Update network settings

jemrobinson commented 4 years ago

:white_check_mark: Checklist

[x] I have searched open and closed issues for duplicates.
[x] This is a request for a new feature in the Data Safe Haven or an upgrade to an existing feature.
[x] The feature is still missing in the latest version.
[x] I have read through the documentation.
[x] This isn't an open-ended question (open a discussion if it is).

:strawberry: Suggested change

Better separation between networks in the SHM and SRE.

:steam_locomotive: How could this be done?

Draft proposal for network settings for SHM and SRE. This top-level comment should be edited to reflect any discussion in the issue below.

SHM Main: `10.0.0.0/24` [`=> 10.0.0.0 - 10.0.0.255`]	Name	Example CIDR	Attached NSG
GatewaySubnet	`10.0.0.0/26` (59 available IPs)	(should probably be locking down inbound access)	SHM VPN
AzureFirewallSubnet	`10.0.0.64/26` (59 available IPs)	"Subnet level NSGs aren't required on the AzureFirewallSubnet"	SHM Firewall - NB. must be at least `/26`
MonitoringSubnet	`10.0.0.128/27` (27 available IPs)	NSG_`SHM ID`_SUBNET_MONITORING	Automation and logging
UpdateServersSubnet	`10.0.0.160/27` (27 available IPs)	NSG_`SHM ID`_SUBNET_UPDATE_SERVERS	Linux/Windows update servers
ControlSubnet	`10.0.0.192/27` (27 available IPs)	NSG_`SHM ID`_SUBNET_CONTROL	DCs
PolicySubnet	`10.0.0.224/27` (27 available IPs)	NSG_`SHM ID`_SUBNET_POLICY	NPS

VPN clients: 10.0.1.0/24 [=> 10.0.1.0 - 10.0.1.255] NB. this cannot overlap with other VNets

SHM Tier-2 mirrors: `10.0.2.0/24` [`=> 10.0.2.0 - 10.0.2.255`]	Name	Example CIDR	Attached NSG
InternalRepositoriesTier2Subnet	`10.0.2.0/25` (123 available IPs)	NSG_`SHM ID`_INTERNAL_REPOSITORIES_TIER2	Internal user-accessible package repositories
ExternalRepositoriesTier2Subnet	`10.0.2.128/25` (123 available IPs)	NSG_`SHM ID`_EXTERNAL_REPOSITORIES_TIER2	External non-accessible package repositories
External package repositories

SHM Tier-3 mirrors: `10.0.3.0/24` [`=> 10.0.3.0 - 10.0.3.255`]	Name	Example CIDR	Attached NSG
InternalRepositoriesTier3Subnet	`10.0.3.0/25` (123 available IPs)	NSG_`SHM ID`_INTERNAL_REPOSITORIES_TIER3	Internal user-accessible package repositories
ExternalRepositoriesTier3Subnet	`10.0.3.128/25` (123 available IPs)	NSG_`SHM ID`_EXTERNAL_REPOSITORIES_TIER3	External non-accessible package repositories
External package repositories

SRE:

SRE 1: 10.1.0.0/21 [=> 10.1.0.0 - 10.1.7.255]
SRE 2: 10.1.8.0/21 [=> 10.1.8.0 - 10.1.15.255]

Name	Example CIDR	Attached NSG	Usage
DeploymentSubnet	`10.1.0.0/24` (251 available IPs)	NSG_`SRE ID`_DEPLOYMENT	VM deployment
RemoteDesktopGatewaySubnet	`10.1.1.0/25` (123 available IPs)	NSG_`SRE ID`_REMOTE_DESKTOP_GATEWAY	Remote desktop gateway
RemoteDesktopAuxiliarySubnet	`10.1.1.128/25` (123 available IPs)	NSG_`SRE ID`_REMOTE_DESKTOP_AUXILIARY	Remote desktop auxiliary servers
PrivateDataSubnet	`10.1.2.0/24` (123 available IPs)		Private data endpoints
DatabasesSubnet	`10.1.3.0/24` (123 available IPs)	NSG_`SRE ID`_DATABASES	Databases
UserServicesSubnet	`10.1.4.0/24` (123 available IPs)	NSG_`SRE ID`_USER_SERVICES	CoCalc, GitLab, HackMD
UserSRDSubnet	`10.1.5.0/24` (123 available IPs)	NSG_`SRE ID`_USER_DESKTOP	SRDs

Note that this scheme gives room to expand the SRE regions further into 10.1.5.* -- 10.1.7.* as required. For example, an HPC cluster could be added at 10.1.5.0/24 or a review subnet could be incorporated.

ens-george-holmes commented 4 years ago

We're in agreement here on the general approaching for carving up the subnets.

I'll work up some detail as part of #589 and then loop back here.

ens-george-holmes commented 4 years ago

I've updated the main comment to make the subnets more specific to a particular type of service/infrastructure.

Attaching an NSG to each subnet will make it easy to manage inter-subnet and inter-VPN security, as the rules will be written for the subnet's address prefix.

Likewise, this approach makes the perimeter firewall rules relatively easy to write, because everything is at the subnet level.

From a security viewpoint, isolating ETL infrastructure for production (MS SQL) databases is a good step. ETL can talk to whether the source data comes from and the prod database subnet, but the prod database subnet can't talk directly to the infra where the source data is sitting.

The artefacts I've committed to the uhb-deployment branch support rolling up a subnet and an NSG into a single composition. ARM incremental updates can then be leveraged to change the NSG rules.

ens-brett-todd commented 4 years ago

Please avoid using subnet named "GatewaySubnet" for anything other than VPN or ER Gateway use (it is reserved by Azure for this purpose)

jemrobinson commented 4 years ago

@martintoreilly: see my updated proposal in the top-level comment. One thing that we'll have to consider is how to deal with deployment into a VNet that already has an NSG attached. Possibilities are:

open up a temporary hole in the NSG allowing outbound access from and disabling VNet access from (possibly with another explicit exemption for the SHM DC or NPS server as appropriate)
deploy into a separate VNet with a more permissive NSG. However this will still need to be paired to the SHM VNet in order for domain joining and changing the IP address post-deployment will break the domain join (plus potentially some other things) anyway
something else?

I think that (1) is the best/easiest thing to do, but would like to know what you think.

martintoreilly commented 4 years ago

@martintoreilly: see my updated proposal in the top-level comment.

Looks good in general. I'd tend to make the subnets smaller where we know there will only be a handful of VMs (e.g. /27s rather than /25s for things like the RDS gateway, SHM DCs, NPS etc.

I'd also suggest we separate the web app servers into a separate subnet so we can restrict access to HTTP/S from everywhere except the admin VPN subnet. Thinking about locking down the Gitlab review and internal servers, I'm wondering if the HackMD server should be on another subnet to the internal Gitlab servers. I suspect we may also find that we may want to separate the next VM we want to put in an airlock from the Gitlab review VM.

Edit: though I guess we could have a basic set of rules that apply to all VMs in the web app subnet (e.g. no outbound, HTTP/S only from user subnet, SSH + RDS from admin VPN subnet) then specific exemptions for each VM that needs it (e.g. HTTP/S and SSH access from Gitlab review for Gitlab internal; SSH from user subnet for GItlab internal; RDS from researcher app RDS SH for Gitlab internal and HackMD; RDS from reviewer app RDS SH for Gitlab review etc)

One thing that we'll have to consider is how to deal with deployment into a VNet that already has an NSG attached. Possibilities are:

open up a temporary hole in the NSG allowing outbound access from and disabling VNet access from (possibly with another explicit exemption for the SHM DC or NPS server as appropriate)

deploy into a separate VNet with a more permissive NSG. However this will still need to be paired to the SHM VNet in order for domain joining and changing the IP address post-deployment will break the domain join (plus potentially some other things) anyway

something else?

I think that (1) is the best/easiest thing to do, but would like to know what you think.

(1) sounds ok to me as long as we get the sequencing right. Something like?

Add high priority rule to allow domain join and other setup (e.g. Kerberos / DNS) by opening specific ports on SHM DC and / or NPS.
Add high priority rules blocking all inbound or outbound access from the about-to-be-deployed VM's internal IP to the VirtualNetwork.
Add rule to allow outbound internet access from the about-to-be-deployed VM's internal IP
Do the deployment
Drop the outbound internet access rule
Drop the Virtual Network blocking rules
Drop the SHM DC / NPS exemption rule

jemrobinson commented 4 years ago

@martintoreilly The main justification behind my choices of CIDRs were:

some services require a minimum subnet size (eg. the firewall needs /26)
using /24 allows you to see what kind of VM you're dealing with by looking at the third octet.

If you want to suggest any specific changes to this scheme that's fine with me.

martintoreilly commented 4 years ago

@jemrobinson @JimMadge Prompted by conversations on the "SRE index" work I did in PR #786, are we happy with limiting ourselves to a maximum of ~255 SREs per SHM by moving to an SRE virtual network CIDR as large as a /16? While we're not hitting that limit here, we might if we choose to offer a "Safe Haven as a service" (or of someone else did, or if a larger organisation wanted to use our implementation).

I do like being able to know which type of VM I'm dealing with by looking at the third octet, but 255 feels quite small for a "no-one's ever going to need more than X SREs" statement. With the current /21 we are limited to ~8,192 SREs (though, as for the 255 in the /16 case some of these are reserved for the SHM vnets).

jemrobinson commented 4 years ago

Using eg. 10.10.0.0/21 for an SRE would certainly be possible, but we might need to rethink the SRE subnet strategy a bit. The current one was designed so that if/when we add new subnets, they are backwards compatible as they are using IP ranges that are currently unallocated. Should we try to shrink the subnet sizes to keep this feature or should we say that it's not that important in practice?

JimMadge commented 4 years ago

255 SREs per SHM does sound a bit small, considering the scenario where an organisation may want to control access with a single SHM and requires one SRE per work package.

It does feel more likely that you would want more than 255 SREs than you would need more than 65,536 IPs within a single SRE.

One advantage of the proposal above is that the octets are meaningful and easy to inspect; 10.1.x.x is part of the VPN subnet, 10.2.x.x is part of the tier2 mirror subnet and so on. I expect we would necessarily loose some of these characteristic IPs.

Alternatively, move everything to IPv6 (probably not a serious suggestion).

jemrobinson commented 4 years ago

I've updated the top-level proposal so that the SRE is based on a 10.10.0.0/21 VNet. @JimMadge / @martintoreilly - are you happy with that? We could reduce this further (eg. to 10.10.0.0/22 giving 252 * 64 = 16128 SREs per SHM).

JimMadge commented 4 years ago

Both of those sound reasonable to me. Do we foresee any risk of making the SRE subnets too small? With webapps and possibly extra DSVMs for GPU, high memory etc. how many IPs would we expect to need in total? >255?

jemrobinson commented 4 years ago

At the moment I think we actually use <16 IP addresses per SRE. Using /21 gives us ~2000 (depends on exactly how we split these into subnets as you lose 5 addresses each time). We'd need a lot of clusters (or something similar) to get us close to even 255 let alone 2000.

jemrobinson commented 3 years ago

Some comments from @martintoreilly on #877:

Compute VMs

We should have specific exceptions for particular service ports to the DC, NPS, Repository, Mirror, Database, Fileserver etc subnets instead of the generic "Allow Virtual Network" rule

RDS Gateway

We should restrict outbound internet access to only that required for Windows updates and RDS operation.
We should also consider blocking access to the Virtual network excepting ports for specific services on the DC, NPS and Session hosts.

RDS Session hosts

Should we restrict access the the Virtual network excepting ports for specific services on the DC, NPS and Session hosts and SSH/RDP on the compute subnet and HTTPS on the Gitlab/HackMD webapp subnet?

General

If we could determine an Ubuntu update FQDN that only allowed updated to key OS features and the database software, I'd also be very comfortable with that. From previous conversations on this PR it sounds like the way to restrict Ubuntu repo access to only already installed packages would be via Nexus.

JimMadge commented 2 years ago

@jemrobinson is this work (or as much as we are interesting in doing now) done? Shall we close the issue?

jemrobinson commented 2 years ago

It's not done, and I'm still interested in this for a future iteration.

jemrobinson commented 1 year ago

This is done in Pulumi and will not be implemented in Powershell.

alan-turing-institute / data-safe-haven