Closed swcurran closed 10 months ago
Notes from meeting on the Mediator from 2023.03.20, with Clecio, Akiff, and Andrew.
limits: CPU: 800m, memory: 256Mi; requests: CPU: 800m, 96Mi
), timeouts on the Wallet side are still happening intermittently. AFAWK -- with these settings the throttling has dropped considerably, such that one would not expect timeouts. E.g. there has not been a 1-1 throttling to timeout relationship.Agreed at the end of the meeting that we want to investigate and understand the issue with ACA-Py as the mediator implementation, and not go down the AFJ as mediator path at all (yet!). We can continue to increase the resources on the Prod mediator we have to try to limit the impact on existing users. Longer term we will look into the what a fully scalable mediator is and what we want to use.
There are three parts to an aca-py
mediator. The aca-py
agent, the wallet (postgres), and the proxy (Caddy). I just had a look at the proxy mediator after we adjusted the agent's resources to limits: CPU: 1000m, memory: 512Mi; requests: CPU: 1000m, memory: 256Mi
. The proxy is not being throttled much at all even though it's resource have been set fairly low; <1%. The wallet on the other hand is being throttled about 50% of the time. I also noticed the agent is using an indy wallet which could be contributing to some of the issues.
Talking with @cvarjao, we're going to adjust the resources on the wallet too. I've found in the past the agent and the wallet need to be adjusted together when there are performance issues.
Edit: The wallet's recourses have been updated to limits: CPU: 800m, memory: 512Mi; requests: CPU: 800m, memory: 192Mi
@WadeBarnes -- what are you calling "the wallet"? Caught me off guard at first, but I'm assuming you mean the "secure storage" for the Mediator? If so -- please use that term. In our world "wallet" means the Mobile Wallet App -- aka BC Wallet and it's very confusing if it used for something else.
Yes, I am referring to the secure storage for the mediator; the mediator agent's wallet database. I'll be sure to use the term secure storage in this context.
First specific task (high level):
Once we have the issue reproduced consider how to address it:
After that, we'll think more about next steps.
I've updated the WebSocket heartbeat and timeout intervals for dev
, test
, and prod
and set them all to ACAPY_WS_HEARTBEAT_INTERVAL=15
and ACAPY_WS_TIMEOUT_INTERVAL=60
as discussed here. Previously we had only updates test
to do some performance testing.
Throttling stats since the config updates: Agent: <1% Throttled Database: <1% Throttled Proxy: <1% Throttled
I’m interested in us getting the 0.8.0 ACA-Py deployed into this, as it explicitly has a fix for this issue. That might be sufficient to eliminate the timeouts. Hopefully our testing in the lower environments will see if this might eliminate the timeouts without the increased limits.
@WadeBarnes — can we lower the resource usage on Dev and Test to be in the range we had it before when we were having trouble? We’re starting to do too many changes at a time, and so we don’t know what the impacts of the individual changes are having. The main one to understand — does the fix in 0.8.0 reduce the resource requirements without causing timeouts?
@WadeBarnes @swcurran Is this issue still a thing?
@jleach, @swcurran, Do we want to reduce the resource allocation in dev
and/or test
to see if the upgrade to ACA-Py 0.8.1 had any affect on some of the issues?
To answer @jleach, since the upgrade and "over provisioning" of the services we only see the occasional timeout when doing load testing.
I’d say we should only lower the resources if we are going to do some testing on dev
and test
— e.g. before and after apples to apples tests. I’m sure we’re not causing anyone grief by leaving things where the are, so the only value in changing is to see if it makes a difference by deliberately testing. Given the minimal use of those mediator, unless we are specific, we’ll never see a problem.
Closing issue, we have not seen any recent timeouts
As reported, we are getting intermittent mediator timeouts in the BC Wallet that result in Wallet issues #812, #987, #988.
Investigations have pointed to a possible problem with ACA-Py, Python Threading and Linux Scheduling and OpenShift/Kubernetes CPU settings. See this ACA-Py Issue #2157 for a deeper discussion on that.
We need to investigate the issues further, and intend to track the issue here. See the first couple of comments for some background and initial steps to be taken.
The goals:
Ultimately, we are deciding:
Tasks