[bitnami/thanos] thanos helm chart renders strange hostname for sidecarsService dnsDiscovery

danfinn commented 3 months ago

Name and Version

bitnami/thanos 12.23.0

What architecture are you using?

amd64

What steps will reproduce the bug?

I'm installing the helm chart like so:

helm upgrade --install thanos bitnami/thanos --values ~/git/thanos/thanos_values.yml

with the values below. What is happening is that the DNS entry for my service is being rendered by the helm chart incorrectly (as far as I can tell) and according to helm template it's coming out looking like this:

          args:
            - query
            - --log.level=info
            - --log.format=logfmt
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10902
            - --query.replica-label=replica
            - --endpoint=dnssrv+_grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local
            - --endpoint=dnssrv+_grpc._tcp.thanos-storegateway.prometheus.svc.cluster.local

I don't know where or why that _grpc._tcp. prefix is coming from but it's causing DNS resolution to break and I get the following errors from the query pod:

ts=2024-03-18T20:45:54.559924978Z caller=resolver.go:99 level=error msg="failed to lookup SRV records" host=_grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local err="no such host"

without the _grpc._tcp. prefix, dns resolution works as expected:

nslookup prometheus-thanos-sidecar-server.prometheus.svc.cluster.local
Server:     10.0.0.10
Address:    10.0.0.10:53

Name:   prometheus-thanos-sidecar-server.prometheus.svc.cluster.local
Address: 10.0.75.135

once you add that on though:

nslookup _grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local
Server:     10.0.0.10
Address:    10.0.0.10:53

** server can't find _grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local: NXDOMAIN

** server can't find _grpc._tcp.prometheus-thanos-sidecar-server.prometheus.svc.cluster.local: NXDOMAIN

Are you using any custom parameters or values?

query:
  nodeSelector:
    kubernetes.io/os: linux
  dnsDiscovery:
    sidecarsService: "prometheus-thanos-sidecar-server"
    sidecarsNamespace: "prometheus"

queryFrontend:
  nodeSelector:
    kubernetes.io/os: linux

bucketweb:
  nodeSelector:
    kubernetes.io/os: linux

compactor:
  nodeSelector:
    kubernetes.io/os: linux
  enabled: true

storegateway:
  nodeSelector:
    kubernetes.io/os: linux
  enabled: true

ruler:
  nodeSelector:
    kubernetes.io/os: linux

receive:
  nodeSelector:
    kubernetes.io/os: linux

receiveDistributor:
  nodeSelector:
    kubernetes.io/os: linux

metrics:
  enabled: true
  serviceMonitor:
    enabled: true

objstoreConfig: |-
  type: AZURE
  config:
      storage_account: "storage_account_name"
      storage_account_key: "storage_account_key"
      container: "thanos"

What is the expected behavior?

I'm not sure why it's adding that strange looking prefix onto the DNS entry for the service

What do you see instead?

see above

Additional information

No response

danfinn commented 3 months ago

this looks like it might be related to https://github.com/thanos-io/thanos/issues/5366 however there is no info on what the fix was for that and I'm not sure what pod labels they are talking about

danfinn commented 3 months ago

you can see he where the prefix is added by the helm chart: https://github.com/bitnami/charts/blob/aeef4fa4e8b68e140157e9cc30474dcc18641afe/bitnami/thanos/templates/query/deployment.yaml#L120

illyul commented 3 months ago

I got same issue with Thanos and CoreDNS. Our k8s cluster is using both of kube-dns and CoreDNS. With kube-dns, everything is okie. But, CoreDNS can't resolve A record.

Workaround: Follow this docs

https://github.com/thanos-io/thanos/blob/main/docs/service-discovery.md#dns-service-discovery

I have changed dnssrv+_grpc._tcp to dns+_grpc._tcp:port.

Postscript: This bug is reported and fixed from 2021 https://github.com/thanos-io/thanos/pull/3672

FraPazGal commented 3 months ago

Hello @danfinn, if I'm understanding your issue correctly the issue comes from the set endpoint to your external prometheus service right? Looking at the SRV records, the dnssrv+_grpc._tcp.service_url will look for the service's port named grpc. Would it be possible the prometheus service port you are connecting to is named differently?

Besides that, could you also try @illyul workaround? In that case you'll be directly setting the port number instead of the port name from the service.

It seems to me we should evaluate having an additional parameter to define the sidecar's portName or portNumber depending on whether we end up using dnssrv+ or dns+.

illyul commented 3 months ago

Besides that, could you also try @illyul workaround? In that case you'll be directly setting the port number instead of the port name from the service.

I configured dns+_grpc._tcp:port. to workaround and it worked for me.

github-actions[bot] commented 2 months ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

FraPazGal commented 2 months ago

Hello @illyul, @danfinn, I have created an internal task for our dev team to look into this and provide a permanent solution. I'll put this issue on-hold and we'll update it as soon as there is any news.

bitnami / charts