bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.97k stars 9.2k forks source link

[bitnami/thanos] Guidance on best way to scale sharded and autoscaled Thanos Store with persistence enabled #29790

Open kaiohenricunha opened 2 weeks ago

kaiohenricunha commented 2 weeks ago

Name and Version

bitnami/thanos 13.4.1

What architecture are you using?

None

What steps will reproduce the bug?

On EKS 1.29,

  1. Deploy a Thanos Store StatefulSet that utilizes PVCs for filesystem storage: https://github.com/bitnami/charts/blob/thanos/13.4.1/bitnami/thanos/values.yaml#L2939
  2. Set up autoscaling: https://github.com/bitnami/charts/blob/thanos/13.4.1/bitnami/thanos/values.yaml#L2987
  3. Set up sharding: https://github.com/bitnami/charts/blob/thanos/13.4.1/bitnami/thanos/values.yaml#L3217
  4. Generate some load on the Thanos Store by querying large long-term metrics.

What is the expected behavior?

I am looking for guidance on the recommended Thanos Store configuration for such a setup that provides:

  1. High Performance: Optimized for fast responses.
  2. High Availability: Allowing the Thanos Store StatefulSet pods to be scheduled on any node or AZ without PVC availability issues.

What do you see instead?

Recently, I tried using Amazon EFS for its high availability, as it allows access from any node/AZ, but Thanos pods with persistence enabled took way too long to start up. I went back to EBS GP3 after noticing Thanos Store had been "fetching metadata" at startup for more than 30 minutes, all while consuming little CPU and memory. On EFS console, I noticed throughput was close to 100%.

With GP3 StorageClass, I run into the issue of Thanos Store pods scaling when traffic peaks, but not being scheduled due to PVC not existing on the node/az.

Additional information

carrodher commented 2 weeks ago

Hi, the issue may not be directly related to the Bitnami container image/Helm chart, but rather to how the application is being utilized, configured in your specific environment, or tied to a particular scenario that is not easy to reproduce on our side.

If you think that's not the case and want to contribute a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

Suppose you have any questions about the application, customizing its content, or technology and infrastructure usage. In that case, we highly recommend that you refer to the forums and user guides provided by the project responsible for the application or technology.

With that said, we'll keep this ticket open until the stale bot automatically closes it, in case someone from the community contributes valuable insights.

github-actions[bot] commented 19 hours ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.