alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 15 forks source link

Use appropriate provider for SHM DNS record #2202

Closed JimMadge closed 1 month ago

JimMadge commented 1 month ago

:white_check_mark: Checklist

:vertical_traffic_light: Depends on

:arrow_heading_up: Summary

:closed_umbrella: Related issues

Closes #2201

:microscope: Tests

github-actions[bot] commented 1 month ago

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  data_safe_haven/commands
  sre.py
  data_safe_haven/external/api
  azure_sdk.py 439-443, 783-794
  data_safe_haven/infrastructure/programs
  declarative_sre.py 56-58
  data_safe_haven/infrastructure/programs/sre
  networking.py 48-50, 1841
Project Total  

This report was generated by python-coverage-comment-action

JimMadge commented 1 month ago

I'm getting a failure to create the SSL cert.

 +  pulumi-python:dynamic:Resource sre_data_kvc_https_certificate
**creating failed** error: Exception calling application: Failed to
create SSL certificate resrudh-kernow-develop-turingsafehaven-ac-uk for
resrudh.kernow.develop.turingsafehaven.ac.uk.  Failed to create DNS TXT record _acme-challenge
in zone resrudh.kernow.develop.turingsafehaven.ac.uk.

Azure SDK says it failed to create the TXT record. Feels like an odd thing to crop up here as this is all within the SRE subscription. The DNS resource already has a set of records deployed with Pulumi.

Possibly related to #2209.

Any thoughts @jemrobinson.

jemrobinson commented 1 month ago

Is this perhaps using the SHM provider instead of the (default) SRE one?

jemrobinson commented 1 month ago

the entire relevant section of code is:

client = ACMEClient(
    domains=[props["domain_name"]],
    email=props["admin_email_address"],
    directory="https://acme-v02.api.letsencrypt.org/directory",
    nameservers=["8.8.8.8", "1.1.1.1"],
    new_account=True,
)
private_key_bytes = client.generate_private_key(key_type="rsa2048")
client.generate_csr()
azure_sdk = AzureSdk(props["subscription_name"], disable_logging=True)
verification_tokens = client.request_verification_tokens().items()
for record_name, record_values in verification_tokens:
    record_set = azure_sdk.ensure_dns_txt_record(...)

which makes me feel like perhaps the ACMEClient is not being correctly created and/or the generate_private_key() or generate_csr() functions aren't doing what we expect. Can you retry with some manual logging interventions (e.g. set disable_logging=False in the AzureSdk call and also add some logger.info lines to help diagnose?)

N.B. this uses the production let's encrypt server, so for debugging it's worth manually changing to the staging server.

JimMadge commented 1 month ago

Yeah that's what I've been looking at. I'll do a bit more digging.

I would be surprised if it were a wrong subscription problem as the SRE subscription gets passed to the dynamic provider as an argument, then the Azure SDK client is created using that argument.

JimMadge commented 1 month ago

I can query the DNS zone using AZ CLI so the permissions should be correct.

jemrobinson commented 1 month ago

This could be the same SSL certificate problem as #2209, i.e. unrelated to these subscription changes.

JimMadge commented 1 month ago

On a positive note. Creating the NS records in the SHM DNS zone works :+1:.

JimMadge commented 1 month ago

Getting somewhere now,

**creating failed** error: Exception calling application: Failed to
create SSL certificate porthperan-kernow-develop-turingsafehaven-ac-uk for
porthperan.kernow.develop.turingsafehaven.ac.uk.  Failed to create DNS TXT record
_acme-challenge in zone porthperan.kernow.develop.turingsafehaven.ac.uk.
(ResourceGroupNotFound) Resource group 'shm-kernow-sre-porthperan-rg' could not be found.
JimMadge commented 1 month ago

The AzureSDK class is using the wrong subscription 🤔.

JimMadge commented 1 month ago

https://github.com/alan-turing-institute/data-safe-haven/blob/27e9dcac26c5ca2b738731d0030aa7625c150930/data_safe_haven/infrastructure/programs/declarative_sre.py#L195

There we go 😆

JimMadge commented 1 month ago

Deployment succeeds at a0ac911. SRE provisioning manager throws a similar error.