Investigate HAProxy Tuning - Githubissues

BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)

Apache License 2.0

8 stars 17 forks source link

Investigate HAProxy Tuning #4394

Closed StevenBarre closed 11 months ago

StevenBarre commented 1 year ago

Describe the issue HAProxy on Silver currently uses a lot of memory, and drives a high load average on the Infra nodes. There are some tuning options available.

HAProxy reloads the config up to every 5 seconds as changes in routes happen. This can be due to pods changing from Ready to Non-Ready or new pods being scaled up.

Each time HAProxy reloads, a new process is spawned and the old process remains open until all connections it is handling have exited. However, with websockets and other long running connections that use keepalive, these connections may be open for weeks at a time.

What is the Value/Impact? Improved infra node stability

What is the plan? How will this get completed? In the labs, test out the two tuning configs.

1) Edit the ingresses.config/cluster to add the hard-stop-after annotation. This will set the max lifetime of long lived connections like websockets. Starting point may be 6h

2) Edit the default ingresscontroller to set the reload time to a value higher than 5s so that less haproxy processes are spawned.

https://docs.openshift.com/container-platform/4.12/scalability_and_performance/optimization/routing-optimization.html#configuring-haproxy-interval_routing-optimization

Identify any dependencies None

Definition of done

[x] Changes made in LAB
[x] PR for changes to playbooks that configure Ingress
[ ] Communicate changes via Community Meetup
[x] Create tickets to apply in PROD

StevenBarre commented 11 months ago

KLAB Before HAProxy Stats

StevenBarre commented 11 months ago

CLAB Before HAProxy Stats

StevenBarre commented 11 months ago

https://github.com/bcgov-c/platform-ops/pull/471

StevenBarre commented 11 months ago

KLAB After HAProxy Stats

StevenBarre commented 11 months ago

CLAB After HAProxy Stats

StevenBarre commented 11 months ago

No issues in the past week with the new settings. Memory usage appears to be a little lower as well. Should be safe to move into production.

StevenBarre commented 11 months ago

CHG0054991 scheduled for Jan 3rd

Slide added to next community meetup