caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
55.53k stars 3.91k forks source link

Add `per_host` option to `metrics` handler #6279

Open hussam-almarzoq opened 2 months ago

hussam-almarzoq commented 2 months ago

This addresses parts of https://github.com/caddyserver/caddy/issues/3784 and https://github.com/caddyserver/caddy/issues/4016. I understand the considerations mentioned by @hairyhenderson regarding the cardinality issue, but I think this is more of a Prometheus issue rather than a Caddy issue. Unlike https://github.com/caddyserver/caddy/issues/4644, no change in Caddy would solve it.

I think adding a clear caution in the docs explaining the implications of enabling this option, much like enabling server metrics, should be enough for the users to understand the tradeoff.

CLAassistant commented 2 months ago

CLA assistant check
All committers have signed the CLA.

francislavoie commented 2 months ago

Cardinality is a Caddy problem, because some users serve thousands of domains via On-Demand TLS.

That said, I'll let @hairyhenderson address this, I'm not in touch with metrics stuff.

hussam-almarzoq commented 2 months ago

Cardinality is a Caddy problem, because some users serve thousands of domains via On-Demand TLS.

That said, I'll let @hairyhenderson address this, I'm not in touch with metrics stuff.

Cardinality limitation is a Prometheus limitation not a Caddy one. It is a consideration that you would have to deal with in Caddy or in any other system that uses Prometheus. I think Caddy should not make it inherently impossible to do this, but have the option there with a sane default – much like having metrics disabled by default. I would love to hear from @hairyhenderson.

hairyhenderson commented 2 months ago

I'm OK in general with this feature, as long as it defaults to false and is well-documented.

@hussam-almarzoq do you think you could issue a companion PR for caddyserver/website that we could merge soon after this?

I have a few review comments - will post those inline.

hairyhenderson commented 2 months ago

In chatting with @francislavoie we realized that only adding the host label when the config is set will break reloads. So before we do that we need to support reinitializing metrics on reload.

Until that's done, I think this change is blocked.

francislavoie commented 2 months ago

I still disagree with cardinality not being an issue. If a user has this enabled and is serving a wildcard domain where the left-most label is mapped to a username in the app (e.g. <username>.example.com) then there can be an infinite amount of domains.

This means the host label's cardinality is infinite. This means more and more RAM will be consumed until Caddy get's OOM-killed. That's not ok. It could be argued we just need to document "don't enable this if the amount of hosts you have is unbound (e.g. wildcard domains or On-Demand TLS)" but I don't think that's a good idea, some users will not properly read the disclaimer and will enable it and have a server that's vulnerable to being OOM-killed by an attacker.

Also yeah, reloading config is still a problem. Caddy needs to support graceful reloads, so if the config changes we need to update the metrics config. AFAIK this isn't possible right now because of global state, and if we changed it to be non-global state then all metrics will be reset on reload (so all counts/averages/etc are lost on reload). I don't know how we'll solve that.

hussam-almarzoq commented 2 months ago

In chatting with @francislavoie we realized that only adding the host label when the config is set will break reloads. So before we do that we need to support reinitializing metrics on reload.

Until that's done, I think this change is blocked.

I think we can avoid the reload issue by having a default value for the label when it is disabled. This way the config can be changed without changing the label count

hussam-almarzoq commented 1 month ago

@hairyhenderson any thoughts?

hairyhenderson commented 1 month ago

I think we can avoid the reload issue by having a default value for the label when it is disabled.

@hussam-almarzoq I'm not excited about that option, because it's now adding a new label to everyone's metrics whether they want it or not.

The key here is:

we need to support reinitializing metrics on reload.

This is necessary for other metrics work as well, not just for this PR.

To respond to what @francislavoie said:

Also yeah, reloading config is still a problem. Caddy needs to support graceful reloads, so if the config changes we need to update the metrics config. AFAIK this isn't possible right now because of global state, and if we changed it to be non-global state then all metrics will be reset on reload (so all counts/averages/etc are lost on reload). I don't know how we'll solve that.

Ultimately I think it's OK for everything to reset on reload. It'll appear as if the process died and was restarted, but that's fine.