bcgov / SIMS

Student Information Management System. Post-Secondary Student Financial Aid System
Apache License 2.0
25 stars 14 forks source link

Analysis and troubleshooting on patroni going to recovery mode on load test #2544

Closed dheepak-aot closed 9 months ago

dheepak-aot commented 11 months ago

Describe the task During the process of camunda load testing #2489, when the load test was performed with 5000 assessments with 20 VUs(virtual users) in DEV environment, it has been consistently observed for every load test (until workers were adjusted to reduce the amount of parallel requests) that the database was temporarily going to recovery mode for few seconds and then coming back up and stable after that.

Error observed in workers :

Private Zenhub Image

How to replicate the issue:

To replicate the issue in DEV environment, run the load test with 5000 iterations and 20 VUs. As the worker configurations are updated to stabilize the load on DB now, before running the load test, update the min replicas for workers to 4 or a higher value.

To install K6 and run load test, follow the wiki: https://github.com/bcgov/SIMS/wiki/Load-testing

Points which came from discussion when this error was observed

There was a point which came up in a discussion regarding the type of user that we use to connect database from all the applications(API, Workers, Queue Consumers) that we use superuser instead of a regular user. Need to investigate if that is a possible root cause.

Acceptance Criteria

Additional context

guru-aot commented 9 months ago

https://cloud.ibm.com/docs/databases-for-postgresql?topic=databases-for-postgresql-managing-connections#postgres-connection-limits

as part of the connection limits to the superuser it is retricted to 15 and the database users 100. The change from the superuser to database user in the load test gives the below results

image.png

The red box notes the load test run assigning the superuser connection for the postgres and the orange one denotes the run that happened assigning the database user connection. Clearly not once the database went to recovery mode. The suggestion is changing the connection parameters in API, load-test-gateway, queue-consumers, and workers to database user than the super user.

In our case the database user does not have proper grants to access the tables and sequences in the database. So please run the below grants for the proper connection of the data to the database schema.

GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA sims TO "app_database_user";