bcgov / DITP-DevOps

Digital Identity and Trust Program Team's DevOps Documentation Repository
Apache License 2.0
2 stars 5 forks source link

Traction ACA-Py instances being heavily throttled in some Traction (bc0192) environments #181

Closed WadeBarnes closed 2 months ago

WadeBarnes commented 3 months ago

The Traction ACA-Py instances in some environment are being throttled at >30% on average. Review and adjust the CPU resource allocations, primarily the CPU limit to reduce or eliminate the throttling. The goal should be to reduce throttling to <25% on average. For production an even lower average may be desirable.

These metrics can be easily reviewed using the Namespace Monitoring dashboard available through Grafana in our new monitoring stack.

swcurran commented 3 months ago

Questions from the raft of these issues:

WadeBarnes commented 3 months ago
  • How are we detecting these situations? Do we have notifications about throttling setup? Presumably, if not we’ll add them?
  • Any idea why these are coming to light now? I thought we had a handle on the throttling and were happy with it.

The issues are being detected through a review of our new monitoring dashboards. I think you have access. The Namespace Monitoring dashboard in particular has a graph specific to throttling that gives us a much better view than we had with any of the other tools available for the platform. What's surfacing are the areas that were overlooked by our targeted testing and adjustments.

Adding alerts would be the next step.

  • Is it a problem to ramp up the CPU as needed, or will that upset the Platform Team?

The adjustments that need to be made should not upset the platform team. They are concerned with over-reservation of resources, which we won't be doing.

  • Presumably, we just need to tweak the deployment scripts/configuration in a PR for the new resource needs and that will trigger a redeploy?

Correct, simple adjustments to the configuration.

WadeBarnes commented 3 months ago

The CPU limit on the Traction ACA-Py instances should be increased from 300m to 500m - 1000m. Recommend 500m for dev, 750m for test, and 1000m for prod. For the most part the instances are not throttled, but they are mostly also idle. When in use the throttling can spike significantly.

WadeBarnes commented 3 months ago

@i5okie, Where should these changes (values file) be made?

i5okie commented 3 months ago

for dev, right in traction repo. for test and prod, trust-over-ip-configurations repo in helm values folder.

i5okie commented 2 months ago

PR's with updated values:

[dev deployment} https://github.com/bcgov/traction/pull/1131

[test and prod deployments] https://github.com/bcgov/trust-over-ip-configurations/pull/207