Closed michaeljmarshall closed 1 year ago
@michaeljmarshall Is it possible to switch the default integration to be PodMonitor if kube-prometheus-stack is deployed? In other words, can we remove the scrape configs from the value file?
@cdbartholomew - I'll do some quick manual testing to compare metrics before and after an upgrade. I am pretty confident we can make this change and remove the old config. The one breaking change is that the current design gets any pod with the prometheus.io/scrape: "true"
annotation. The new pod monitor limits the scraping to pulsar pods deployed by this helm chart. I think that is a better default, so we'll just need to do a minor version bump instead of a patch, and we'll call out the change in the release notes.
Is there a reason why PodMonitor was chosen over ServiceMonitor?
It's worth noting in the readme what prometheusSpec settings need to be enabled to discover monitors in other namespaces.
Is there a reason why PodMonitor was chosen over ServiceMonitor?
There is no one service that points to all of these components, and I don't think it would make sense to create a service just to create a service monitor. That being said, in my testing today, I came across an issue with the proxy pod. It can have multiple containers that need to be scraped for metrics. However, scraping based on a single port name, http
in this PR at this time, will only target a single container. The solution needs to target every container. So this PR won't work as it is currently written.
Note that the current solution actually does not scrape metrics from those pods correctly at this time. We only get metrics from the proxy container, but not burnell or the websocket proxy.
It's worth noting in the readme what prometheusSpec settings need to be enabled to discover monitors in other namespaces.
Definitely, I'll document the final solution before this gets merged.
@michaeljmarshall you wouldn't need to have a single service, you can just match services based on a label like app=pulsar
or something as long as the ports are named consistently between the services.
I looked at @pgier's suggestion, and it would work to select the many pulsar services using the labels, but we'd still need a PodMonitor
to cover the pulsar heartbeat and the bookkeeper autorecovery pods. Given the way Prometheus discovers "endpoints" and "pods", it seems less optimal to have two scrape definitions when one is sufficient. It is also simpler to have a single resource that configures monitoring, so I am going to push forward with the PodMonitor
.
For reference, here is the service discovery page for one ServiceMonitor
and one PodMonitor
:
here is the service discovery page for one PodMonitor
:
The main reason there are two service monitors and two pod monitors in the first and second screen shots, respectively, is that we have the burnell
port being monitored. When that one isn't monitored, there is only one monitor.
It's also interesting to point out that the main difference between a service monitor comes in the kubernetes_sd_configs
part of the scrape definition documented here https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config. The PodMonitor
uses "role: pod" and the ServiceMonitor
uses "role: endpoints".
Merging now with the plan to release the helm chart in a day or so. If anyone has any concerns, please reach out.
Motivation
The current Prometheus Scrape Config is defined by passing it to the sub prometheus chart. As a result, it requires manual intervention for users with their own Prometheus instance to configure metrics scraping. In order to accommodate these users, I propose adding a single pod monitor that targets all of our metrics endpoints. In order to make this change, I needed to update some port names to
http
. I don't believe that will be a problem for most use cases, but it could break any that are already using the port name.Changes
PodMonitor
to target 9 components. These are the components withprometheus.io/scrape: "true"
, so the end result will be the same components being monitored. The relabel configs are similar to, but not the same as the scrape config currently used: https://github.com/datastax/pulsar-helm-chart/blob/a93a7cf4beb4a83290f4a19baa57b80b64d0392e/helm-chart-sources/pulsar/values.yaml#L2085-L2108http
port definition for autorecovery, bookkeeper, zookeeper, and zookeepernpfunction
andfunctiontls
withhttp
andhttps
, respectively, on the function pod. These pod names are not used anywhere in the chart, but they could be used by some users. This breaking change should be documented in the release notes.