Open miekg opened 7 years ago
Hello @miekg,
this sounds like a great idea!
I believe the best approach would be to create an API in github.com/xenolf/lego/acme
-- It might be useful for other projects using lego/acme
Basically you should "hack in into ACME" 😄
__
On the other side, I believe it would be easier to create those metics directly in Caddy and expose them to caddy-prometheus
[ Quoting notifications@github.com in "Re: [mholt/caddy] Exposing TLS cert..." ]
Hello @miekg,
this sounds like a great idea!
I believe the best approach would be to expose those metrics in
github.com/xenolf/lego/acme
-- It might be useful for other projects usinglego/acme
Basically you should "hack in into ACME" 😄
Interesting thought. I'll look into that, there could either be an API in ACME, or directly expose prometheus metrics. I'll open an issue there.
On the other side, I believe it would be easier to create those metics directly in Caddy and expose them to
caddy-prometheus
But how do I get access to the information that those metrics will expose?
/Miek
-- Miek Gieben
But how do I get access to the information that those metrics will expose?
You could import caddy/caddytls
in caddy-prometheus
and use those metrics -- I have not really thought deeply about this ..........
I just posted some ideas I had :smile:, my raw thoughts.
Interesting thought. I'll look into that, there could either be an API in ACME, or directly expose prometheus metrics. I'll open an issue there.
Awesome :smile:
I think we can do something like this; Caddy can emit events but right now it only emits one (startup) because we're adding events based on need. I don't have a page on the wiki yet explaining how to hook into events (I'll do that soon) but it's very easy; you can look at the caddy-service plugin and see how it does it.
Although I think we'll need to change the signature of event hooks to pass in some information. Shouldn't be too hard since only that plugin uses event hooks; I'll talk to the authors about getting it updated.
I think this is better done in Caddy than lego.
Interesting. What new events would we need? As I alluded to above, qps to let's encrypt, possible errors returned from the let's encrypt, TLS cert properties (don't know what is interesting there).
We could add a lot of events around TLS management. Caddy scans loaded certs every 12 hours for renewal, or every hour for OCSP stapling, those are events (although maybe uninteresting, and not needed to emit). When Caddy starts renewing a certificate, or finishes successfully or has an error, those are events. When it updates OCSP staples, that's an event. When it obtains a new certificate with on-demand TLS, that's an event. When it loads a certificate from disk, that's an event. By emitting these with some information attached (currently infeasible but we can change that) you should be able to get what you'd need, I think?
QPS to Let's Encrypt specifically doesn't make much sense since requests to LE are few and far between.
Edit: If you want really fine-grained information, such as every low-level network interaction that lego makes in order to validate a challenge, that'll have to be built into lego, of course, and I dunno how that would go. But for everything at or above "Obtain a certificate" or "renew a certificate", Caddy can emit those.
@miekg If this is still something you're interested in, I think you should join our discussion in the #telemetry channel in Slack. I think that will be the best fit for this feature, since we expect telemetry to be exportable in various formats or directly to external monitoring services.
Either that or we overhaul logging to be able to tee to different outputs.
/cc @hairyhenderson
@mholt Hello ! 😃 Any recent update on this one ? It would be awesome to have the aforementioned metrics integrated !
Just adding a top-up request for TLS certificate metrics; specifically I'm interested in metrics on ACME renewals and particularly renewal failures. The goal here is to get metrics on how many times we are renewing/issuing within a given window to see where we are in relation to LetsEncrypt rate limits, and additionally any metrics on failed renewals would be useful, since failed validations count towards the new order limits.
It'd be great if we can collaboratively create the list of metrics labels and descriptions based on what you, as users, need to track. Based on that, we can assess wiring up the collectors.
Sure, I'm happy to kick us off, here's some examples of things I think I'd find useful.
I'd group it all under 'certs', try to correlate to the certmagic events?
# HELP caddy_certs_obtain_total Total certificates obtained.
# TYPE caddy_certs_obtain_total counter
caddy_certs_obtain_total{issuer="acme",server="srv3","action":"new", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1
caddy_certs_obtain_total{issuer="acme",server="srv3","action":"renewal", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1
# HELP caddy_certs_obtain_error_total Total certificate errors.
# TYPE caddy_certs_obtain_error_total counter
caddy_certs_obtain_error_total{issuer="acme",server="srv3","action":"new", "reason": "validation-failed", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1
caddy_certs_obtain_error_total{issuer="acme",server="srv3","action":"renewal", "reason": "rate-limit", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1
# HELP caddy_certs_obtain_inflight In-flight certificates being obtained.
# TYPE caddy_certs_obtain_inflight gauge
caddy_certs_obtain_inflight{issuer="acme",server="srv3","action":"new", "directory": "https://acme-staging-v02.api.letsencrypt.org/directory"} 1
caddy_certs_obtain_inflight{issuer="internal",server="srv3","action":"renewal"} 1
# HELP caddy_certs_remaining_seconds_at_renewal The remaining time on certificates at successful renewal (if this is skewing to quite late it means we're not hitting that 1/3 target)
# TYPE caddy_certs_remaining_seconds_at_renewal histogram
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.005"} 5674
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.01"} 5849
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.025"} 13168
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.05"} 17458
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.1"} 21198
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.25"} 23510
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="0.5"} 23721
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="1"} 23753
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="2.5"} 23793
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="5"} 23850
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="10"} 23923
caddy_certs_remaining_seconds_at_renewal_bucket{issuer="acme",le="+Inf"} 23998
caddy_http_request_duration_seconds_sum{issuer="acme"} 3877.6205545949847
caddy_http_request_duration_seconds_count{issuer="acme"} 23998
This is a question/feature request.
For caddy-prometheus (github.com/miekg/caddy-prometheus) I would like to expose metrics from the certs caddy has configured. Like time left per cert, failures to update, etc. etc.
Is there an API for this, should caddy expose some of these bits or should I hack in into ACME (? believe that is used for handling LE comms)?