alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 15 forks source link

SRE ssl certificate not getting created on deployment #2209

Closed mattwestby closed 3 weeks ago

mattwestby commented 1 month ago

:white_check_mark: Checklist

:computer: System information

:package: Packages

List of packages ```none Paste list of packages here ```

:no_entry_sign: Describe the problem

When attempting to deploy the SRE v5.0.0 the ssl certificate which is used by the application gateway doesnt get created in the SRE keyvault. When re-deploying the logs just say ERROR - Failed to retrieve certificate and doesnt try to recreate it.

:deciduous_tree: Log messages

Relevant log messages ```none ERROR - Failed to retrieve certificate testsre5-shm5-nottingham-ac-uk. ```

:recycle: To reproduce

jemrobinson commented 1 month ago

Thanks for this bug report @mattwestby. Can you add the supporting information about versions/OS etc. to the template above?

mattwestby commented 1 month ago

HI guys, any update on this one? THanks Matt

jemrobinson commented 1 month ago

@mattwestby : I don't think we've been able to reproduce this (@craddm was going to look into deploying from an Azure Windows VM but I'm not sure how far he got with that). I might take a look at another way to upload the certificate but I haven't had time yet.

mattwestby commented 1 month ago

thanks @jemrobinson - is there a way i could manually create the cert just to get me past this sticking point for now?

jemrobinson commented 1 month ago

thanks @jemrobinson - is there a way i could manually create the cert just to get me past this sticking point for now?

If you're able to run the following Python code on your deployment machine, inserting domain_name and admin_email_address as appropriate and adding a DNS TXT record when indicated, this should generate a certificate called <certificate name>.cert which you can upload to the SRE keyvault as a certificate called <certificate name>.

import time
from simple_acme_dns import ACMEClient
from cryptography.hazmat.primitives.asymmetric.rsa import RSAPrivateKey
from cryptography.hazmat.primitives.serialization import NoEncryption, load_pem_private_key, pkcs12
from cryptography.x509 import load_pem_x509_certificate

domain_name = "whatever your domain name is"
admin_email_address = "whatever email address you're using"

client = ACMEClient(
    domains=domain_name,
    email=admin_email_address,
    directory="https://acme-v02.api.letsencrypt.org/directory",
    nameservers=["8.8.8.8", "1.1.1.1"],
    new_account=True,
)

# Generate private key and CSR
# Note that we must set the key to RSA-2048 before generating the CSR
# The default is ecdsa-with-SHA25, which Azure Key Vault cannot read
private_key_bytes = client.generate_private_key(key_type="rsa2048")
client.generate_csr()
verification_tokens = client.request_verification_tokens().items()
print("At this point you will need to manually a TXT record to the DNS zone for your SRE")
for record_name, record_values in verification_tokens:
    print(f"record_name {record_name.replace(f'.{domain_name}', '')}; record_value {record_values[0]}")

# Wait for DNS propagation to complete
while not client.check_dns_propagation(authoritative=False, round_robin=True, verbose=False):
    print("DNS propagation is ongoing")
    time.sleep(30)

# Request a signed certificate
certificate_bytes = client.request_certificate()
private_key = load_pem_private_key(private_key_bytes, None)
if not isinstance(private_key, RSAPrivateKey):
    msg = f"Private key is of type {type(private_key)} not RSAPrivateKey."
    raise TypeError(msg)
all_certs = [
    load_pem_x509_certificate(data)
    for data in certificate_bytes.split(b"\n\n")
]
certificate = next(cert for cert in all_certs if domain_name in str(cert.subject))
ca_certs = [cert for cert in all_certs if cert != certificate]
certificate_secret_name = domain_name.replace(".", "-")
pfx_bytes = pkcs12.serialize_key_and_certificates(
    certificate_secret_name.encode("utf-8"),
    private_key,
    certificate,
    ca_certs,
    NoEncryption(),
)
with open(f"{certificate_secret_name}.cert", "wb") as f_cert:
    f_cert.write(pfx_bytes)
JimMadge commented 1 month ago

@mattwestby were you able to reproduce this in a new deployment? Can you see if the cert does exist or not?

My best guess is that your deployment has somehow ended up in a state where Pulumi believes the cert has been created (it is in the Pulumi stack, so when you run deploy Pulumi will not try to create it) but the cert has not been put into storage.

If that is the case, I think the fix to your broken deployment is either,

and we may want to make changes to the code to make the cert generation more robust. However, I'm not certain there is a code change we would want to make which would fix your deployment as this feels like a rare occurrence which is mostly out of our control.

JimMadge commented 1 month ago

@mattwestby can reproduce this? We haven't been able to.

jemrobinson commented 3 weeks ago

For future reference @JimMadge, my best guess as to why this happened is that the certificate was created on the deployment machine and that machine then tried to upload it to the Key Vault using its Managed Identity, rather than the appropriate Azure CLI credentials.