bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
9k stars 9.22k forks source link

Update thanos 15.7.15 to 15.7.16, sidecars no longer show up on thanos query stores #29310

Closed Bah27 closed 1 week ago

Bah27 commented 1 month ago

Name and Version

thanos/15.7.16

What architecture are you using?

None

What steps will reproduce the bug?

Update charts thanos 15.7.15 to 15.7.16

Are you using any custom parameters or values?

existingObjstoreSecret: thanos-objstore
metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    namespace: "monitoring"   
query:
  enable: true
  dnsDiscovery:
  enable: false
  sidecarsService: prometheus-operated
  sidecarsNamespace: monitoring
  grpc:
    client:
      tls:
        enabled: true
        existingSecret:
          name: thanos-cert
          keyMapping:
            ca-cert: ca.crt
            tls-cert: tls.crt
            tls-key: tls.key
            clientAuthEnabled: true
  stores: 
    - "@domain1:443"
    - "@domain2:443"
    - "@domain2:443"
  extraFlags:
     - --grpc-client-tls-skip-verify
     - --store.response-timeout=0  
  replicatLabel: prometheus_replica   
  resources:
    requests:
      cpu: 150m 
      memory: 150Mi 
    limits:
      #cpu: 50m
      #memory: 200Mi 

  nodeSelector:
    k8s.scaleway.com/app: monitoring

  tolerations:
  - key: "k8s.scaleway.com/nodepool"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"

queryFrontend:
  enabled: true
  config: |-
    type: IN-MEMORY
    config:
      max_size: 1GB
      max_size_items: 0
      validity: 0s

  resources:
    requests:
      cpu: 10m
      memory: 100Mi
    limits:
      #cpu: 100m
      memory: 100Mi

  nodeSelector:
    k8s.scaleway.com/app: monitoring

  tolerations:
  - key: "k8s.scaleway.com/nodepool"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"

  ingress:
    enabled: false

compactor:
  enabled: true
  retentionResolutionRaw: 14d
  retentionResolution5m: 14d
  retentionResolution1h: 20d
  consistencyDelay: 30m
  extraFlags:
  - --delete-delay=2h

  persistence:
    enabled: false

  resources:
    requests:
      cpu: 200m
      memory: 200Mi
    limits:
      #cpu: 100m
      #memory: 200Mi

  nodeSelector:
    k8s.scaleway.com/app: monitoring

  tolerations:
  - key: "k8s.scaleway.com/nodepool"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"

receive:
 enabled: false

bucketweb:
 enabled: false

storegateway:
  enabled: true
  grpc:
    server:
      tls:
        enabled: true
        existingSecret:
          name: thanos-cert
          keyMapping:
            ca-cert: ca.crt
            tls-cert: tls.crt
            tls-key: tls.key
            clientAuthEnabled: true 

  persistence:
    enabled: false

  resources:
    requests:
      cpu: 100m
      memory: 100Mi 
    limits:
      #cpu: 100m
      #memory: 100Mi
  nodeSelector:
    k8s.scaleway.com/app: monitoring

  tolerations:
  - key: "k8s.scaleway.com/nodepool"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"

NB: @domain1 is the domain name of each sidecar.

What do you see instead?

ts=2024-09-09T12:28:26.896367373Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain1:443
ts=2024-09-09T12:28:26.896680946Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain2:443
ts=2024-09-09T12:28:26.896714506Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing: dial tcp 10.38.88.81:10901: connect: connection refused\"" address=10.38.88.81:10901
juan131 commented 1 month ago

Hi @Bah27

We updated the Thanos version to 0.36.0 on that release, see https://github.com/bitnami/charts/pull/28607

It seems that, on version 0.36.1 a fix for a regression on TLS config was included in Query:

Could you try with that version? It's already available in the Bitnami chart using latest version.

github-actions[bot] commented 1 month ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

Bah27 commented 1 month ago

Hello @juan131

I apologize for the lack of follow-up; I was on vacation. I will test version 0.36.1, as recommended, to check if the issue related to the TLS configuration is resolved with the fix mentioned in this pull request.

In the meantime, I have observed several errors in the logs, including:

ts=2024-09-30T08:30:06.159166083Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain1:443
ts=2024-09-30T08:30:06.159645638Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain2:443
ts=2024-09-30T08:30:11.162302928Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=@domain3:443
ts=2024-09-30T08:30:11.162285816Z caller=endpointset.go:471 level=warn component=endpointset msg="update of endpoint failed" err="getting metadata: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: tls: first record does not look like a TLS handshake\"" address=100.64.41.170:10901

These logs show errors related to timeouts and TLS authentication issues on the following endpoints:

Thank you for your patience!

juan131 commented 1 month ago

Thanks @Bah27 ! Please let us know about your insights once you try it with the latest chart version.

Bah27 commented 1 month ago

Thank you @juan131! I proceeded with the tests using the latest chart version, but unfortunately, I am still encountering the same errors.

juan131 commented 4 weeks ago

Hi @Bah27

Sorry for the delay in my response. I've been reviewing the values you shared paying special attention to the block below:

query:
  (...)
  grpc:
    client:
      tls:
        enabled: true
        existingSecret:
          name: thanos-cert
          keyMapping:
            ca-cert: ca.crt
            tls-cert: tls.crt
            tls-key: tls.key
            clientAuthEnabled: true
  stores: 
    - "@domain1:443"
    - "@domain2:443"
    - "@domain2:443"
  extraFlags:
     - --grpc-client-tls-skip-verify
     - --store.response-timeout=0 

It seems you enabled TLS for GRPC in the client side but you didn't do the same for the server side (query.grpc.server.tls.enabled is false by default and you didn't modify it). Also, you're setting the property query.grpc.client.tls.clientAuthEnabled which doesn't exist, I guess you meant query.grpc.server.tls.clientAuthEnabled, right? See:

Also, regarding this block:

query:
  dnsDiscovery:
    enable: false
    sidecarsService: prometheus-operated
    sidecarsNamespace: monitoring

Please note query.dnsDiscovery.sidecarsService and query.dnsDiscovery.sidecarsNamespace will be ignored if query.dnsDiscovery.enabled is false, see:

Bah27 commented 4 weeks ago

Hi @juan131,

Thanks for your reply and for taking the time to carefully review the configuration details.

TLS for gRPC server You're absolutely right. I had enabled TLS on the client side but missed doing so on the server side. I'll correct this by adding query.grpc.server.tls.enabled: true. And yes, I mistakenly used clientAuthEnabled in the wrong place. What I meant to use was query.grpc.server.tls.clientAuthEnabled.

Thanks for pointing that out—it really helped me understand the mistake. I’ll adjust the configuration as you suggested.

dnsDiscovery Regarding DNS discovery, good catch! I didn't realize that query.dnsDiscovery.sidecarsService and sidecarsNamespace would be ignored if enable is set to false. I’ll either enable dnsDiscovery.enable or remove those parameters if they're not needed.

Thanks again for the clarifications and for linking the documentation—this was super helpful!

I’ll update everything and run some tests.

github-actions[bot] commented 1 week ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 1 week ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.