apache / trafficcontrol

Apache Traffic Control is an Open Source implementation of a Content Delivery Network
https://trafficcontrol.apache.org/
Apache License 2.0
1.08k stars 344 forks source link

Let's Encrypt certificates sometimes fails to write to Traffic Vault - postgresql #7129

Open smalenfant opened 2 years ago

smalenfant commented 2 years ago

Notices that this happened after a 6.1.0 upgrade from 5.1.2 and also after a Traffic Vault Riak -> Postgresql migration.

This Bug Report affects these Traffic Control components:

Current behavior:

When requesting certificate renewal, the ACME/Let's Encrypt process goes through but fails to write the new certificate to the database.

/var/log/message:

Oct 12 13:52:13 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:13 [INFO] acme: Querying account for https://acme-v02.api.letsencrypt.org/acme/acct/102607427
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [] acme: Trying renewal with 543 hours remaining
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Obtaining bundled SAN certificate
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [crs.national-linear-red.cdn1.coxlab.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/163738563286
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Could not find solver for: tls-alpn-01
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Could not find solver for: http-01
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: use dns-01 solver
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Preparing to solve DNS-01
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Trying to solve DNS-01
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Checking DNS record propagation using [x.x.x.x:53]
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] Wait for propagation [timeout: 20m0s, interval: 30s]
Oct 12 13:52:14 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:14 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Waiting for DNS record propagation.
Oct 12 13:52:44 cdn1cdcms0001 traffic_ops: 2022/10/12 13:52:44 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Waiting for DNS record propagation.
Oct 12 13:53:01 cdn1cdcms0001 systemd: Started Session 56323 of user root.
Oct 12 13:53:20 cdn1cdcms0001 traffic_ops: 2022/10/12 13:53:20 [INFO] [crs.national-linear-red.cdn1.coxlab.net] The server validated our request
Oct 12 13:53:20 cdn1cdcms0001 traffic_ops: 2022/10/12 13:53:20 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Cleaning DNS-01 challenge
Oct 12 13:53:20 cdn1cdcms0001 traffic_ops: 2022/10/12 13:53:20 [INFO] [crs.national-linear-red.cdn1.coxlab.net] acme: Validations succeeded; requesting certificates
Oct 12 13:53:20 cdn1cdcms0001 traffic_ops: 2022/10/12 13:53:20 [INFO] [crs.national-linear-red.cdn1.coxlab.net] Server responded with a certificate.

Traffic Ops Log:

x.x.x.x:55562 national-linear-red: putting keys in Traffic Vault: could not begin Traffic Vault PostgreSQL transaction: context deadline exceeded: context deadline exceeded
Error posting acme certificate to Traffic Vault: could not begin Traffic Vault PostgreSQL transaction: context deadline exceeded: context deadline exceeded
failed to write response (method = POST, URL = /api/4.0/deliveryservices/xmlId/national-linear-red/sslkeys/renew, request ID = 711, remote addr = x.x.x.x:55562, bytes written = 0): http: Handler timeout

Expected behavior:

Certificate to be written to DB.

Steps to reproduce:

See above.

ocket8888 commented 2 years ago

Notices that this happened after a 6.1.0 upgrade from 5.1.2 and also after a Traffic Vault Riak -> Postgresql migration.

Does that mean after doing both of those things, or after doing either of them?

smalenfant commented 2 years ago

Seems related to #7128. Looks like there is issues with Traffic Vault - postgresql backend.

mitchell852 commented 2 years ago

@smalenfant - can you add to the description which version(s) of TC this bug exists in.

smalenfant commented 2 years ago

@mitchell852 I added. I'm pretty this should also affect master/7.1.x as I don't see changes regarding the backend.

smalenfant commented 1 year ago

Just an update here. When I issue a renew, this will always happen on the first try (500 error on the UI). I just go back and renew again and it "renews successfully".

First "renew": 2023/01/23 14:01:45 [INFO] [*.ds.cdn1.coxlab.net] acme: Validations succeeded; requesting certificates Error posting acme certificate to Traffic Vault: could not begin Traffic Vault PostgreSQL transaction: context deadline exceeded: context deadline exceeded

Second "renew": 2023/01/23 14:03:42 [INFO] [*.ds.cdn1.coxlab.net] Server responded with a certificate.

smalenfant commented 1 year ago

I fixed/workaround this problem by increasing the timeouts in cdn.conf from 60 seconds to 120 seconds. The renew API is not asynchronous and prone to timeout depending how long it takes to complete the renewal.

I'm not exactly sure which one of the 8-10 timeout defined fixed it. Possibly the idle_timeout.