apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
246 stars 111 forks source link

Servicemonitor for prometheus exporter is referring to cluster port instead of metrics pod port #483

Closed sanjay3290 closed 1 year ago

sanjay3290 commented 1 year ago

I have followed the solr operator documentation to configure SolrPrometheusExporter, however after creating the servicemonitor, the service endpoint is going inactive. After further troubleshooting, i realized the metric server is trying to connect to port 80 whereas the metrics server is running on port 8080. Is it possible to pass port into service monitor?

Get "http://x.x.x.x:80/metrics": dial tcp x.x.x.x:80: connect: connection refused

HoustonPutman commented 1 year ago

Can you provide the yaml for the service monitor you created?

sanjay3290 commented 1 year ago

Hello @HoustonPutman, Below is the Servicemonitor yaml , i used the default provided in SolrOperator Documentation.

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: solr-metrics labels: release: prometheus spec: selector: matchLabels: solr-prometheus-exporter: solr-dev-prom-exporter namespaceSelector: matchNames:

HoustonPutman commented 1 year ago

So you are using a serviceMonitor, and the Solr metrics service is listening on port 80, or at least it should be... The pod is listening on port 8080, but the service forwards that 80 -> 8080 when sending the request to the pod.

I have almost the exact same thing working correctly.

What version of the prometheus stack are you running? Also can you provide information on your Kube cluster? (version, vendor, etc) I have a feeling there's an issue with your networking.

sanjay3290 commented 1 year ago

you are right, thats how its supposed to work. However the service endpoint in prometheus targets is referencing to http://podIP:90/metrics and due to that, the connection is getting refused. My other default service endpoints for prometheus are working as expected.

Prometheus : Chart:prometheus-15.16.1 Version:2.39.1
Kubernetes: AWS EKS, Version:1.22

HoustonPutman commented 1 year ago

Are you sure you don't have a podMonitor defined as well?

Looks like there might be a bug in the prometheus operator? In the meantime you can use targetPort instead to set 8080. Here are the available options under endpoints.

koboltmarky commented 1 year ago

We have the same problem here. We are using the solr-operator 0.6 and prometheus 2.39.1 hosted in gke version 1.21. We aren't using the prometheus operator. I deployed the solr prometheus exporter with the following snippet:

apiVersion: solr.apache.org/v1beta1
kind: SolrPrometheusExporter
metadata:
  name: solr-prom-exporter
spec:
  customKubeOptions:
      resources:
        requests:
          cpu: 300m
          memory: 900Mi
  solrReference:
    basicAuthSecret: solr-cloud-k8s-oper-secret 
    cloud:
      name: "apache-solr"
  numThreads: 6

As you can see in the screenshot prometheus tries to connect to the pod on port 80 which is the wrong port.

Screenshot from 2022-12-01 16-17-22

Our workaround is to add a prometheus scraping annotation to the exporter pod:

spec:
  customKubeOptions:
    podOptions:
      annotations:
        prometheus.io/port: "8080"
        prometheus.io/path: /metrics
        prometheus.io/scrape: "true"
        prometheus.io/scheme: http
HoustonPutman commented 1 year ago

In that screenshot, is the 10.110.6.70 IP address the service ClusterIP or the pod IP? If it's the service's then there is something wrong with kubernetes. If its the pod, then Prometheus shouldn't be trying to contact the pod at all, it should be contacting the service IP...

sanjay3290 commented 1 year ago

We have the same problem here. We are using the solr-operator 0.6 and prometheus 2.39.1 hosted in gke version 1.21. We aren't using the prometheus operator. I deployed the solr prometheus exporter with the following snippet:

apiVersion: solr.apache.org/v1beta1
kind: SolrPrometheusExporter
metadata:
  name: solr-prom-exporter
spec:
  customKubeOptions:
      resources:
        requests:
          cpu: 300m
          memory: 900Mi
  solrReference:
    basicAuthSecret: solr-cloud-k8s-oper-secret 
    cloud:
      name: "apache-solr"
  numThreads: 6

As you can see in the screenshot prometheus tries to connect to the pod on port 80 which is the wrong port.

Screenshot from 2022-12-01 16-17-22

Our workaround is to add a prometheus scraping annotation to the exporter pod:

spec:
  customKubeOptions:
    podOptions:
      annotations:
        prometheus.io/port: "8080"
        prometheus.io/path: /metrics
        prometheus.io/scrape: "true"
        prometheus.io/scheme: http

even after adding pod annotation, prometheus still looking at port 80 on pod IP in my case. Something is seriously wrong with this.below is my exporter config.

apiVersion: solr.apache.org/v1beta1
kind: SolrPrometheusExporter
metadata:
  name: solr-prom-exporter
spec:
  customKubeOptions:
    podOptions:
      annotations:
        prometheus.io/port: "8080"
        prometheus.io/path: /metrics
        prometheus.io/scrape: "true"
        prometheus.io/scheme: http
      resources:
        requests:
          cpu: 300m
          memory: 900Mi
  solrReference:
    cloud:
      name: "eks"
  numThreads: 6
Screenshot 2022-12-01 at 3 54 04 PM
koboltmarky commented 1 year ago

In that screenshot, is the 10.110.6.70 IP address the service ClusterIP or the pod IP? If it's the service's then there is something wrong with kubernetes. If its the pod, then Prometheus shouldn't be trying to contact the pod at all, it should be contacting the service IP...

It is the pod ip

koboltmarky commented 1 year ago

even after adding pod annotation, prometheus still looking at port 80 on pod IP in my case. Something is seriously wrong with this.below is my exporter config.

The old failed target will still exits but there should be a new target which should works.

HoustonPutman commented 1 year ago

Can you share your prometheus scraping config? This seems to be a prometheus issue...

tiimbz commented 1 year ago

We are having the same issue. The prometheus.io/port annotation is set to port 80, which doesn't correspond with the port of the pod. This causes Prometheus to fail to scrape the service endpoint.

We've also bypassed the problem by enabling scraping of the pods directly:

  customKubeOptions:
    podOptions:
      annotations:
        prometheus.io/port: "8080"
        prometheus.io/path: /metrics
        prometheus.io/scrape: "true"
        prometheus.io/scheme: http

The Prometheus scraping config we use is the default kubernetes-service-endpoints job from the default config.

tiimbz commented 1 year ago

Looking at the code, it looks like the prometheus.io/port value is set from ExtSolrMetricsPort, not SolrMetricsPort which would have fixed the problem.

Any attempts to overwrite this by using custom serviceAnnotations is not working, as custom annotations can only supplement the default ones, not overwrite them: https://github.com/apache/solr-operator/blob/main/controllers/util/prometheus_exporter_util.go#L400

samuelverstraete commented 1 year ago

We have exactly the same issue.

coolstim commented 1 year ago

We are having the same issue. The prometheus.io/port annotation is set to port 80, which doesn't correspond with the port of the pod. This causes Prometheus to fail to scrape the service endpoint.

We've also bypassed the problem by enabling scraping of the pods directly:

  customKubeOptions:
    podOptions:
      annotations:
        prometheus.io/port: "8080"
        prometheus.io/path: /metrics
        prometheus.io/scrape: "true"
        prometheus.io/scheme: http

The Prometheus scraping config we use is the default kubernetes-service-endpoints job from the default config.

Indeed, this is a valid workaround

HoustonPutman commented 1 year ago

So it seems like everyone is using kubernetes-service-endpoints, could you try using kubernetes-services and see if the problem is fixed?

I think the issue is that this feature was designed with the kubernetes-services usage in mind, however it looks like it should work with kubernetes-service-endpoints as well, but breaks in this way. I don't think there's a way that we can get both to work at the same time, unless we remove the prometheus.io/port annotation all-together.

I will try to test this locally but it might be difficult. I'm happy to create a test docker image for anyone else to try out (based on v0.6.0) and see if it fixes things for them.

coolstim commented 1 year ago

Situation before Solr:

Situation after Solr: We installed the solr-exporter using

apiVersion: solr.apache.org/v1beta1
kind: SolrPrometheusExporter
metadata:
  name: solr-prom-exporter
spec:
  customKubeOptions:
    podOptions:
      resources:
        requests:
          cpu: 300m
          memory: 900Mi
  solrReference:
    cloud:
      name: "eks"
  numThreads: 6

No metrics are scraped from Solr as, by default, it seems Prometheus is using the endpoints? Default Prometheus configuration:

global:
  evaluation_interval: 1m
  scrape_interval: 1m
  scrape_timeout: 10s
remote_write:
- queue_config:
    capacity: 2500
    max_samples_per_send: 1000
    max_shards: 200
  sigv4:
    region: east-us-1
  url: https://aps-workspaces.east-us-1.amazonaws.com/workspaces/XXX/api/v1/remote_write
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
  static_configs:
  - targets:
    - localhost:9090
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  job_name: kubernetes-apiservers
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: keep
    regex: default;kubernetes;https
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_service_name
    - __meta_kubernetes_endpoint_port_name
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  job_name: kubernetes-nodes
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - replacement: kubernetes.default.svc:443
    target_label: __address__
  - regex: (.+)
    replacement: /api/v1/nodes/$1/proxy/metrics
    source_labels:
    - __meta_kubernetes_node_name
    target_label: __metrics_path__
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  job_name: kubernetes-nodes-cadvisor
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - replacement: kubernetes.default.svc:443
    target_label: __address__
  - regex: (.+)
    replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
    source_labels:
    - __meta_kubernetes_node_name
    target_label: __metrics_path__
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
- honor_labels: true
  job_name: kubernetes-service-endpoints
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scrape
  - action: drop
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: (.+?)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_service_annotation_prometheus_io_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_service_annotation_prometheus_io_param_(.+)
    replacement: __param_$1
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_node_name
    target_label: node
- honor_labels: true
  job_name: kubernetes-service-endpoints-slow
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: (.+?)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_service_annotation_prometheus_io_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_service_annotation_prometheus_io_param_(.+)
    replacement: __param_$1
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_node_name
    target_label: node
  scrape_interval: 5m
  scrape_timeout: 30s
- honor_labels: true
  job_name: prometheus-pushgateway
  kubernetes_sd_configs:
  - role: service
  relabel_configs:
  - action: keep
    regex: pushgateway
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_probe
- honor_labels: true
  job_name: kubernetes-services
  kubernetes_sd_configs:
  - role: service
  metrics_path: /probe
  params:
    module:
    - http_2xx
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_probe
  - source_labels:
    - __address__
    target_label: __param_target
  - replacement: blackbox
    target_label: __address__
  - source_labels:
    - __param_target
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
- honor_labels: true
  job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scrape
  - action: drop
    regex: true
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
    replacement: '[$2]:$1'
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    - __meta_kubernetes_pod_ip
    target_label: __address__
  - action: replace
    regex: (\d+);((([0-9]+?)(\.|$)){4})
    replacement: $2:$1
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    - __meta_kubernetes_pod_ip
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
    replacement: __param_$1
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - action: drop
    regex: Pending|Succeeded|Failed|Completed
    source_labels:
    - __meta_kubernetes_pod_phase
- honor_labels: true
  job_name: kubernetes-pods-slow
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
    replacement: '[$2]:$1'
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    - __meta_kubernetes_pod_ip
    target_label: __address__
  - action: replace
    regex: (\d+);((([0-9]+?)(\.|$)){4})
    replacement: $2:$1
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    - __meta_kubernetes_pod_ip
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
    replacement: __param_$1
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - action: drop
    regex: Pending|Succeeded|Failed|Completed
    source_labels:
    - __meta_kubernetes_pod_phase
  scrape_interval: 5m
  scrape_timeout: 30s
HoustonPutman commented 1 year ago

I have a patch that I think should work: https://github.com/apache/solr-operator/pull/539. Would someone be willing to try out this fix in their cluster?

Steps to try it:

  1. Checkout the v0.6.0 release
  2. Copy this one line change
  3. Run make docker-build, then upload to docker somewhere
  4. Update your Solr Operator to use this new image
  5. Delete the prometheus exporter service just to make sure the annotation is removed: kubectl delete service <name>-solr-metrics
  6. Wait for it to come back and see if Prometheus is happier!

If it does work we can get this into the v0.7.0 release that should be coming soon!

coolstim commented 1 year ago

It seems to be working

HoustonPutman commented 1 year ago

Cool, I will go ahead and merge then!