Open slopezz opened 1 year ago
I have a few points for decision which will affect how theses changes are done. The limitador-operator allows multiply limitador CRs in the same namespace and/or in multiply namespaces.
Hi @Boomatang ,
For simplicity, I would treat the PodMonitor
for a given limitador CR, as any other usual resource attached to the limitador CR, like the Service
or the Deployment
.
That means, each Limitador CR will have its own PodMonitor
, with its own labelSelectors taken from the limitador CR, the same it has its own Deployment
or Service
, keeping things simple.
Regarding the dashboard, I didn't know limitador-operator could manage limitador CR instances on different namespaces or even multiple instances on the same namespace.
In our 3scale SaaS use case, we have a single limitador instance managing rate limits for any given namespace (since it is used the k8s Service name from envoy), and I guess that having a single instance would be the most usual case.
It is a bit tricky here, since with a single GrafanaDashboard
you can view the metrics from any possible limitador instance (you would need to use the limitador instance name as the dashboard selector, aside from namespace maybe).
If you have multiple limitador instances on different namespaces, by default Grafana-operator will create a GrafanaDashboard
on every Namespace, since the namespace name is used as a dashboard directory name.
However, what about multiple instances in the same namespace? TBH I don't know the best way to handle this situation, since any object created by the operator will add its own ownerReference
annotation with the CR name...., and if you have multiple CRs creating the same dashboard name (so same resource) in the same namespace, it will only have the ownerReferences
from one of them (which is not super bad, but maybe not ideal).
In our saas-operator case, we have multiple CRDs that can only be installed once in a namespace, but there is a single case where the CRD can have multiples instances on the same namespace, we finally ended up creating the same dashboard for every CR, using the CR name as a suffix for the dashboard name, so having multiple dashboards showing the same info, which it is not an ideal solution actually...
We have another scenario with prometheus-exporter-operator where we permit multiple CRs per namespace, but only a single dashboard per CR, so we end up with a single dashboard per namespace used to watch metrics from any CR, however the ownerReferences
of the dashboard resource are from a single CR. If the CR associated to the dashboard is deleted, thanks to ownerReferences
the dashboard resource will be deleted, but operator will detect there is a missing dashboard resource for the other possible instances and will create the dashboard again, but with a different dashboard ID (needed if you want the same dashboard URL).
So I don't see which could be the best solution here.
Do you know how other cluster operators manage the GrafanaDashboard when you can have multiple instances even in the same namespace?
In 3scale SaaS we have been using successfully limitador for a couple of years together with Redis, to protect all our public endpoints. However:
We would like to update how we manage limitador application, and use the most recommended limitador setup using limitador-operator, with a production-ready grade.
Current limitador-operator (at least the version
0.4.0
that we use):Desired features:
3scale SaaS specific example
Example of the PodMonitor used in 3scale SaaS production to manage between 3,500 and 5,500 requests/second with 3 limitador pods (selector labels need to coincide with the labels managed right now by limitador-operator):
Possible CR config
Both PodMonitor and GrafanaDashboard should be able to be customized via CR, but use default sane values if they are enabled, so you dont need to provide all the config if you dont want, and want to trust on defaults.
The initial dashboard would be provided by us initially (3scale SRE), can be embedded into operator as an asset, like done with 3scale-operator.
Current Dashboard screenshots including limitador metrics by
limitador_namespace
(the app being limited), and also pods, resources cpu/mem/net metrics:PrometheusRules (aka prometheus alerts)
Regarding PrometheusRules (prometheus alerts), my advise is to not embed them into the operator, but provide in the repo a yaml with an example of possible alerts that can be deployed, tuned... by the app administrator if needed.
Example: