ACME certificate obtain failure due to completed order

krezovic commented 3 years ago

I've configured ACME in caddy2 running inside kubernetes and exposed both ports 80 and 443 on external load balancer. Hostname is configured to the load balancer IP, and we use in-house acme server (links are redacted) which seems to complete the order before caddy calls "complete order" endpoint.

Could it be a problem that I started two caddy2 replicas when the order was initially created?

The following is seen in the log:

{"level":"info","ts":1604831334.2792566,"logger":"http","msg":"enabling automatic TLS certificate management","domains":["hostname.redacted.net"]}
{"level":"info","ts":1604831334.2795486,"msg":"autosaved config","file":"/home/caddy/.config/caddy/autosave.json"}
{"level":"info","ts":1604831334.2796035,"msg":"serving initial configuration"}
{"level":"info","ts":1604831334.2796938,"logger":"watcher","msg":"watching config file for changes","config_file":"/etc/caddy/Caddyfile"}
{"level":"info","ts":1604831334.2797627,"logger":"tls","msg":"cleaned up storage units"}
{"level":"info","ts":1604831334.2801478,"logger":"tls.obtain","msg":"acquiring lock","identifier":"hostname.redacted.net"}
{"level":"info","ts":1604831334.280494,"logger":"tls.obtain","msg":"lock acquired","identifier":"hostname.redacted.net"}
{"level":"info","ts":1604831334.8188531,"logger":"tls.issuance.acme","msg":"waiting on internal rate limiter","identifiers":["hostname.redacted.net"]}
{"level":"info","ts":1604831334.8191102,"logger":"tls.issuance.acme","msg":"done waiting on internal rate limiter","identifiers":["hostname.redacted.net"]}
{"level":"info","ts":1604831335.1436524,"logger":"tls.issuance.acme.acme_client","msg":"trying to solve challenge","identifier":"hostname.redacted.net","challenge_type":"http-01","ca":"https://valid.acme.com/directory"}
{"level":"info","ts":1604831347.3850152,"logger":"tls.issuance.acme","msg":"served key authentication","identifier":"hostname.redacted.net","challenge":"http-01","remote":"10.6.0.21:64959"}
{"level":"info","ts":1604831349.7754695,"logger":"tls.issuance.acme.acme_client","msg":"validations succeeded; finalizing order","order":"https://valid.acme.com/order/dvWgUhR4QzYfzcq"}
{"level":"error","ts":1604831349.995524,"logger":"tls.obtain","msg":"will retry","error":"[hostname.redacted.net] Obtain: [hostname.redacted.net] finalizing order https://valid.acme.com/order/dvWgUhR4QzYfzcq: unexpected state: ready - order already finalized (ca=https://valid.acme.com/directory)","attempt":1,"retrying_in":60,"elapsed":15.714963749,"max_duration":2592000}
{"level":"info","ts":1604831410.460576,"logger":"tls.issuance.acme.acme_client","msg":"validations succeeded; finalizing order","order":"https://valid.acme.com/order/6vAfoFweFcmISpA"}
{"level":"error","ts":1604831410.6749024,"logger":"tls.obtain","msg":"will retry","error":"[hostname.redacted.net] Obtain: [hostname.redacted.net] finalizing order https://valid.acme.com/order/6vAfoFweFcmISpA: unexpected state: ready - order already finalized (ca=https://valid.acme.com/directory)","attempt":2,"retrying_in":120,"elapsed":76.39434291,"max_duration":2592000}

I also found a similar bug in cert-manager, so not sure if they use same library:

https://github.com/jetstack/cert-manager/issues/2765

We're using caddy2 2.2.1.

francislavoie commented 3 years ago

Could it be a problem that I started two caddy2 replicas when the order was initially created?

Yes, that could be the problem, if they don't share the same storage. Caddy requires all instances to use the same storage backend to properly coordinate.

https://caddyserver.com/docs/automatic-https#storage

When using the Docker image, the default storage is the /data directory.

krezovic commented 3 years ago

I've scaled down replicas to 1 after I got the message the first time, but it still won't go away ... Any ideas how to proceed?

francislavoie commented 3 years ago

Well, the error is coming from your acme server, no? It doesn't actually sound like a problem with Caddy to me.

Caddy's ACME implementation is https://github.com/mholt/acmez if you wanted to dig around. @mholt may have more insight when he has a minute.

mholt commented 3 years ago

It's 99% unlikely to be a bug in Caddy or CertMagic or acmez. There's not nearly enough information here to reproduce the bug, unfortunately, and I'm almost certain it's a bug in some home-brewed ACME server implementation.

krezovic commented 3 years ago

Thank you both for your input. We will contact our ACME team. I was just surprised that it works just fine with cert-manager, but not with caddy2 so I thought it to be a bug.

Best regards.

mholt commented 3 years ago

Keep us posted. Happy to reopen if it can be narrowed down as a bug in our stack somewhere. But with the limited information here so far, the only thing I can tell is different is the homebrewed acme server, so that's my current suspicion.

caddyserver / caddy

ACME certificate obtain failure due to completed order #3857