bcgov / bc-wallet-mobile

BC Wallet to hold Verifiable Credentials
Apache License 2.0
61 stars 48 forks source link

Spike: Investigating Mediator Timeout Issue -- server side #990

Closed swcurran closed 10 months ago

swcurran commented 1 year ago

As reported, we are getting intermittent mediator timeouts in the BC Wallet that result in Wallet issues #812, #987, #988.

Investigations have pointed to a possible problem with ACA-Py, Python Threading and Linux Scheduling and OpenShift/Kubernetes CPU settings. See this ACA-Py Issue #2157 for a deeper discussion on that.

We need to investigate the issues further, and intend to track the issue here. See the first couple of comments for some background and initial steps to be taken.

The goals:

Ultimately, we are deciding:

Tasks

swcurran commented 1 year ago

Notes from meeting on the Mediator from 2023.03.20, with Clecio, Akiff, and Andrew.

Agreed at the end of the meeting that we want to investigate and understand the issue with ACA-Py as the mediator implementation, and not go down the AFJ as mediator path at all (yet!). We can continue to increase the resources on the Prod mediator we have to try to limit the impact on existing users. Longer term we will look into the what a fully scalable mediator is and what we want to use.

WadeBarnes commented 1 year ago

There are three parts to an aca-py mediator. The aca-py agent, the wallet (postgres), and the proxy (Caddy). I just had a look at the proxy mediator after we adjusted the agent's resources to limits: CPU: 1000m, memory: 512Mi; requests: CPU: 1000m, memory: 256Mi. The proxy is not being throttled much at all even though it's resource have been set fairly low; <1%. The wallet on the other hand is being throttled about 50% of the time. I also noticed the agent is using an indy wallet which could be contributing to some of the issues.

WadeBarnes commented 1 year ago

Talking with @cvarjao, we're going to adjust the resources on the wallet too. I've found in the past the agent and the wallet need to be adjusted together when there are performance issues.

Edit: The wallet's recourses have been updated to limits: CPU: 800m, memory: 512Mi; requests: CPU: 800m, memory: 192Mi

swcurran commented 1 year ago

@WadeBarnes -- what are you calling "the wallet"? Caught me off guard at first, but I'm assuming you mean the "secure storage" for the Mediator? If so -- please use that term. In our world "wallet" means the Mobile Wallet App -- aka BC Wallet and it's very confusing if it used for something else.

WadeBarnes commented 1 year ago

Yes, I am referring to the secure storage for the mediator; the mediator agent's wallet database. I'll be sure to use the term secure storage in this context.

swcurran commented 1 year ago

First specific task (high level):

Once we have the issue reproduced consider how to address it:

After that, we'll think more about next steps.

WadeBarnes commented 1 year ago

I've updated the WebSocket heartbeat and timeout intervals for dev, test, and prod and set them all to ACAPY_WS_HEARTBEAT_INTERVAL=15 and ACAPY_WS_TIMEOUT_INTERVAL=60 as discussed here. Previously we had only updates test to do some performance testing.

WadeBarnes commented 1 year ago

Throttling stats since the config updates: Agent: <1% Throttled Database: <1% Throttled Proxy: <1% Throttled

swcurran commented 1 year ago

I’m interested in us getting the 0.8.0 ACA-Py deployed into this, as it explicitly has a fix for this issue. That might be sufficient to eliminate the timeouts. Hopefully our testing in the lower environments will see if this might eliminate the timeouts without the increased limits.

swcurran commented 1 year ago

@WadeBarnes — can we lower the resource usage on Dev and Test to be in the range we had it before when we were having trouble? We’re starting to do too many changes at a time, and so we don’t know what the impacts of the individual changes are having. The main one to understand — does the fix in 0.8.0 reduce the resource requirements without causing timeouts?

jleach commented 1 year ago

@WadeBarnes @swcurran Is this issue still a thing?

WadeBarnes commented 1 year ago

@jleach, @swcurran, Do we want to reduce the resource allocation in dev and/or test to see if the upgrade to ACA-Py 0.8.1 had any affect on some of the issues?

To answer @jleach, since the upgrade and "over provisioning" of the services we only see the occasional timeout when doing load testing.

swcurran commented 1 year ago

I’d say we should only lower the resources if we are going to do some testing on dev and test — e.g. before and after apples to apples tests. I’m sure we’re not causing anyone grief by leaving things where the are, so the only value in changing is to see if it makes a difference by deliberately testing. Given the minimal use of those mediator, unless we are specific, we’ll never see a problem.

jeffaudette commented 10 months ago

Closing issue, we have not seen any recent timeouts