DSG Deployment Automation

james-c commented 5 years ago

Target for April DSG 2019

[x] Script Network Lock Down (see issue #236)
[x] VNET peering - done (see issue #246, merged into PR #249)
[x] DSG specific secrets and upload to KeyVault
- [x] All DSG user passwords automatically fetched from KeyVault (and created if they don't exist) as part of PR #174.
- [x] Create RDS SSL cert password when automating RDS configuration
[ ] Run remote scripts from local version controlled copies
- [x] SH AD DC config
- [x] Create service accounts - done in PR #174
- [x] Add new DSG to DNS - done in PR #174
- [x] DSG AD DC config
- Set OS language
- Create users / Groups / OUs
- Configure DNS
- Configure GPOs
- Update admin users
- Server start menu
- [ ] RDS config
- Move RDSSH servers to correct OU
- Run OS Prep on gateway and both RDSSH
- Install apps on session hosts
- Deploy RDS environment (including adding RSSSH servers to Server manager)
- Export of SSH keys and import into RDS / RDS web client?
- Installation of RD Web client
- Configuring RDS server with SH management NPS server (RDS and NPS sides)
- [ ] Data server config
  - Add Data server to OU
  - Run config script
- [x] GitLab + HackMD config (see PR #239, now merged into PR #249)
- Duplicate base VM deployment?
- Script config in cloud-init
[ ] Compute VM improvements
- Mount Data Server storage (see Gitlab / HackMD con fig in runbook)
[ ] Single config files for SH management and each DSG
- [x] SH DC config for DSG uses unified config file
- [x] DSG DC create script uses unified config file
- [x] DSG RDS create script uses unified config file
- [x] DSG Dataserver create script uses unified config file
- [x] DSG Linux server create script uses unified config file
- [ ] Use unified config file when automating post-creation remote scripts
[ ] Automate DSG user creation more fully
- [ ] Run from local machine
- [ ] Add "Azure Active Directory Premium P2" licence
- [ ] Activate MFA

Do later

AD Trust
SSL Cert

Still to triage

[ ] idempotent scripting
[x] Secret Generation and Preparation, secrets in management tier already exist. Create / reuse secrets when needed in keyvault. Secrets generated in script ( [System.Web.Security.Membership]::GeneratePassword(20,0)). Done in PR #174. [System.Web.Security.Membership] is not available when using Powershell 6 with .Net Core on OSX so we copied and modified the [System.Web.Security.Membership]::GeneratePassword() C# code in our own Powershell function.
[ ] Scripts currently run on vms can be moved to run from local machine
[ ] Installation of software packages (can click-throughs be automated?)
[ ] Fully automate certificate generation / install (e.g. DNS record response etc). (see issue #203)
[ ] Post-install automated (potentially continuous) sanity check (validation of machines running)
[ ] Start / Stop / Tear down scripting - process tbd for tear down preparation
[ ] Monitoring scripting
[ ] Split management and DSG specific secrets into separate key vaults (query if this is needed)
[ ] Add remote desktop to HackMD and Gitlab boxes?
[x] Have single config files for safe haven parameters and each DSG parameters that all scripts load parameters from
[ ] Consider separating build and deployment configuration as we do for the compute VMs. Pros: pays the build / software installation cost once rather than once per DSG. Severs our dependence on external downloads. Cons: We won't be validating that we can still build our environment from scratch on each deployment.
[ ] If we can do all our deployment via the Azure management plane (i.e. Azure CLI / SDK commands), consider not setting up the VPN gateway until the very end. It's only necessary if we need to log into any of the boxes directly, so is ideally only needed for troubleshooting.
[x] Make Create_New_DSG_User_Service_Accounts script robust to failures in user creation
- If password doesn't meet requirements re-prompt user
- Ideally have script create passwords and store them in the KeyVault
[x] Standardise KeyVault name and secret names (can we drop the test environment element)?
[x] Update dsgpu user to dsvm or similar
[x] Have a single config file to set key fields that is read by all scripts (see dsg9-test.yml for starter for 10)
[ ] Give the various VMs read access to the "artifacts" storage account via Azure Active Directory authentication over SMB. Ideally give then read access to this repository and store the setup scripts here. The installers for the various apps are too large for Github, but we should be able to add download and installation steps to the VM setup script for the RDS server.
[ ] Automate renewal of Lets Encrypt for RDS SSL certs (and other SSL certs as required).
[ ] Consider whether we want a single pool of people with RDP access to the management gateway and the DSG gateways (probably yes - note that we are storing all DSG secrets in a single KeyVault in the management subscription, which makes this a single trust pool from a secrets management perspective)
[ ] Make scripts recoverable so that re-running after failure results in same deployment state as running without error while avoiding deleting and re-creating resources that have successfully deployed (we will probably need to explicitly track successful deployment of each resource and what "in progress" resources we need to delete on a re-run).

From Validate Feb 2019 Azure runbook #174

[x] Update the compute VM deployment scripts to use the convention: ldap-dsg<X>-<environment>-<resource-type> e.g. ldap-dsg9-test-dsgpu for LDAP secret names (should be done as part of #176)
[ ] Add scripting for password protecting .pfx certificate when downloading (see: https://coombes.nz/blog/azure-keyvault-export-certificate/)
[x] Add a page to the runbook describing how to set up the subscription and what quotas to request for a subsubscription being used for test or production. See issue #125
[x] We had some issues with HackMD loading slowly due to trying to use a content distribution network (CDN) for pulling down javascript, style sheets etc. See issue #59 for fix for this. I'm not sure if this fix has been back ported to the HackMD deployment scripts / installer package. Verified in manual instructions in PR #174. Validated in automated setup in PR #239 (merged into PR #249)
[x] Ensure we install LaTeX + editor on report writing windows server (done in PR #174).
[ ] The P2S RootCert Public/Private Key pair should be different for each DSG (and the management segment). Add a section covering how to make a self-signed cert, upload this to the KeyVault and make new Client Certs.
[ ] Consider whether we should have different LDAP username and passwords for each compute VM instance (GitLab and HackMD have their own)? Pro: no shared secrets across VMs; Con: need to access management DC as admin to add new LDAP user, rather than just needing permission to deploy a new VM.

To-dos for future versions

Ensure each of these is captured as an issue.
[ ] Make Create_New_DSG_User_Service_Accounts script robust to failures in user creation
- If password doesn't meet requirements re-prompt user
- Ideally have script create passwords and store them in the KeyVault
[x] Standardise KeyVault name and secret names (can we drop the test environment element)?
[x] Update dsgpu user to dsvm or similar
[x] Have a single config file to set key fields that is read by all scripts (see dsg9-test.yml for starter for 10)
[ ] Give the various VMs read access to the "artifacts" storage account via Azure Active Directory authentication over SMB. Ideally give then read access to this repository and store the setup scripts here. The installers for the various apps are too large for Github, but we should be able to add download and installation steps to the VM setup script for the RDS server.
[ ] Automate renewal of Lets Encrypt for RDS SSL certs (and other SSL certs as required).
[ ] Consider whether we want a single pool of people with RDP access to the management gateway and the DSG gateways (probably yes - note that we are storing all DSG secrets in a single KeyVault in the management subscription, which makes this a single trust pool from a secrets management perspective)
[ ] Make scripts recoverable so that re-running after failure results in same deployment state as running without error while avoiding deleting and re-creating resources that have successfully deployed (we will probably need to explicitly track successful deployment of each resource and what "in progress" resources we need to delete on a re-run).

martintoreilly commented 5 years ago

From the Python CLI, the command to run remote scripts on a VM is az vm run

martintoreilly commented 5 years ago

For VNET peering look at this ARM template

martintoreilly commented 5 years ago

For running scripts as part of the deployment of a VM, look at custom script extensions (windows) or cloud init

martintoreilly commented 5 years ago

Where some steps take a long time (e.g. installing software), we should consider splitting the build and deploy stages as we do for the Linux Compute VMs, building an image and storing it in an image Gallery.

martintoreilly commented 5 years ago

Regarding internet access during setup and locking down the environment afterwards, look at what we do when deploying the Linux compute VMs. I think we programatically bind the VM to the locked down NSG to another.

In general, I think the ideal deployment model for all VMs (including the compute ones) would be:

Deploy to internet accessible NSG where the box can call out but accepts no inbound connections (I think for the compute VMs we do this, but also open a small 2-way window to the SG management segment so that the LDAP AD domain join works on first boot).
Do any internet-dependent installation / update
Move VM into the secure zone by binding to the relevant NSG
Ideally have any steps that require visibility of other machines in the secure DSG zone or safe haven management segment happen automatically after a reboot. I think cloud init lets you configure when various bits should run (once, every boot etc). I'm not sure about the Windows equivalent.

martintoreilly commented 5 years ago

@james-c I'm thinking the big picture for the end goal of this automation is to make sure that everything used in the DSG deployment relies only on scripts in source control. i.e. have all the scripts that run locally on deployed VMs in source control and then push them to the VMs on deployment and run them remotely with cloud-init / custom script extensions (or SCP + az vm run if necessary).

martintoreilly commented 5 years ago

@RobC-CTL Is there anything sensitive in the CreateADPDC.zip folder in the RG_DSG_ARTIFACTS -> dsgxartifacts -> Blobs -> dsc storage container? I'd like to move it into source control in this repo (which will eventually be public).

RobC-CTL commented 5 years ago

@martintoreilly Nothing sensitive

martintoreilly commented 5 years ago

@RobC-CTL Just checking that the DSG DC, RDS and Dataserver zip files in the Scripts folder of the RG_DSG_ARTIFACTS -> dsgxartifacts -> configpackages share also son't have anything sensitive and can be added to source control.

RobC-CTL commented 5 years ago

@martintoreilly they are just PS scripts, there is mention of the domain name but other than that there isn't anything too sensitive.

jemrobinson commented 4 years ago

@martintoreilly : is there anything left here that hasn't been captured in a dedicated issue?

martintoreilly commented 4 years ago

Happy to close this. Lots of it is done, lots is captured in other issues, some no longer relevant. If any of what's left is important enough we'll think of it again.

martintoreilly commented 4 years ago

@jemrobinson I'm not up to speed with the new label system. Is this part of our transition to a DevOps model? Let me know what I should be updating.

alan-turing-institute / data-safe-haven