caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
58.52k stars 4.04k forks source link

Exposing TLS certificate metrics #1683

Open miekg opened 7 years ago

miekg commented 7 years ago

This is a question/feature request.

For caddy-prometheus (github.com/miekg/caddy-prometheus) I would like to expose metrics from the certs caddy has configured. Like time left per cert, failures to update, etc. etc.

Is there an API for this, should caddy expose some of these bits or should I hack in into ACME (? believe that is used for handling LE comms)?

elcore commented 7 years ago

Hello @miekg,

this sounds like a great idea!

I believe the best approach would be to create an API in github.com/xenolf/lego/acme -- It might be useful for other projects using lego/acme

Basically you should "hack in into ACME" 😄

__

On the other side, I believe it would be easier to create those metics directly in Caddy and expose them to caddy-prometheus

miekg commented 7 years ago

[ Quoting notifications@github.com in "Re: [mholt/caddy] Exposing TLS cert..." ]

Hello @miekg,

this sounds like a great idea!

I believe the best approach would be to expose those metrics in github.com/xenolf/lego/acme -- It might be useful for other projects using lego/acme

Basically you should "hack in into ACME" 😄

Interesting thought. I'll look into that, there could either be an API in ACME, or directly expose prometheus metrics. I'll open an issue there.

On the other side, I believe it would be easier to create those metics directly in Caddy and expose them to caddy-prometheus

But how do I get access to the information that those metrics will expose?

/Miek

-- Miek Gieben

elcore commented 7 years ago

But how do I get access to the information that those metrics will expose?

You could import caddy/caddytls in caddy-prometheus and use those metrics -- I have not really thought deeply about this ..........

I just posted some ideas I had :smile:, my raw thoughts.

elcore commented 7 years ago

Interesting thought. I'll look into that, there could either be an API in ACME, or directly expose prometheus metrics. I'll open an issue there.

Awesome :smile:

mholt commented 7 years ago

I think we can do something like this; Caddy can emit events but right now it only emits one (startup) because we're adding events based on need. I don't have a page on the wiki yet explaining how to hook into events (I'll do that soon) but it's very easy; you can look at the caddy-service plugin and see how it does it.

Although I think we'll need to change the signature of event hooks to pass in some information. Shouldn't be too hard since only that plugin uses event hooks; I'll talk to the authors about getting it updated.

I think this is better done in Caddy than lego.

miekg commented 7 years ago

Interesting. What new events would we need? As I alluded to above, qps to let's encrypt, possible errors returned from the let's encrypt, TLS cert properties (don't know what is interesting there).

mholt commented 7 years ago

We could add a lot of events around TLS management. Caddy scans loaded certs every 12 hours for renewal, or every hour for OCSP stapling, those are events (although maybe uninteresting, and not needed to emit). When Caddy starts renewing a certificate, or finishes successfully or has an error, those are events. When it updates OCSP staples, that's an event. When it obtains a new certificate with on-demand TLS, that's an event. When it loads a certificate from disk, that's an event. By emitting these with some information attached (currently infeasible but we can change that) you should be able to get what you'd need, I think?

QPS to Let's Encrypt specifically doesn't make much sense since requests to LE are few and far between.

Edit: If you want really fine-grained information, such as every low-level network interaction that lego makes in order to validate a challenge, that'll have to be built into lego, of course, and I dunno how that would go. But for everything at or above "Obtain a certificate" or "renew a certificate", Caddy can emit those.

mholt commented 6 years ago

@miekg If this is still something you're interested in, I think you should join our discussion in the #telemetry channel in Slack. I think that will be the best fit for this feature, since we expect telemetry to be exportable in various formats or directly to external monitoring services.

Either that or we overhaul logging to be able to tee to different outputs.

mholt commented 4 years ago

/cc @hairyhenderson

loss commented 3 months ago

@mholt Hello ! 😃 Any recent update on this one ? It would be awesome to have the aforementioned metrics integrated !

alistairjevans commented 2 months ago

Just adding a top-up request for TLS certificate metrics; specifically I'm interested in metrics on ACME renewals and particularly renewal failures. The goal here is to get metrics on how many times we are renewing/issuing within a given window to see where we are in relation to LetsEncrypt rate limits, and additionally any metrics on failed renewals would be useful, since failed validations count towards the new order limits.

mohammed90 commented 2 months ago

It'd be great if we can collaboratively create the list of metrics labels and descriptions based on what you, as users, need to track. Based on that, we can assess wiring up the collectors.

alistairjevans commented 2 months ago

Sure, I'm happy to kick us off, here's some examples of things I think I'd find useful.

I'd group it all under 'certs', try to correlate to the certmagic events?

# HELP caddy_certs_obtain_total Total certificates obtained.
# TYPE caddy_certs_obtain_total counter
caddy_certs_obtain_total{issuer="acme",server="srv3","action":"new", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1
caddy_certs_obtain_total{issuer="acme",server="srv3","action":"renewal", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1

# HELP caddy_certs_obtain_error_total Total certificate errors.
# TYPE caddy_certs_obtain_error_total counter
caddy_certs_obtain_error_total{issuer="acme",server="srv3","action":"new", "reason": "validation-failed", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1
caddy_certs_obtain_error_total{issuer="acme",server="srv3","action":"renewal", "reason": "rate-limit", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1

# HELP caddy_certs_obtain_inflight In-flight certificates being obtained.
# TYPE caddy_certs_obtain_inflight gauge
caddy_certs_obtain_inflight{issuer="acme",server="srv3","action":"new", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1
caddy_certs_obtain_inflight{issuer="internal",server="srv3","action":"renewal"} 1

# HELP caddy_certs_remaining_seconds_at_renewal The remaining time on certificates at successful renewal (if this is skewing to quite late it means we're not hitting that 1/3 target)
# TYPE caddy_certs_remaining_seconds_at_renewal histogram
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.005"} 5674
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.01"} 5849
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.025"} 13168
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.05"} 17458
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.1"} 21198
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.25"} 23510
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.5"} 23721
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="1"} 23753
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="2.5"} 23793
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="5"} 23850
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="10"} 23923
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="+Inf"} 23998
caddy_http_request_duration_seconds_sum{issuer="acme"} 3877.6205545949847
caddy_http_request_duration_seconds_count{issuer="acme"} 23998