bcgov / SIMS

Student Information Management System. Post-Secondary Student Financial Aid System
Apache License 2.0
25 stars 14 forks source link

Patroni going into recovery mode after changing the credentials #2783

Closed guru-aot closed 8 months ago

guru-aot commented 9 months ago

Patroni going into recovery mode after changing the credentials for workers, api, queue-consumers and load-test-gateway from superuser to database user.

MicrosoftTeams-image (6).png MicrosoftTeams-image (7).png MicrosoftTeams-image (8).png
dheepak-aot commented 9 months ago

pulling into sprint and added points as per @ninosamson

guru-aot commented 9 months ago

As part of the analysis done, the database in dev is going into recovery mode when assessment-workflow-queue-retry queue is running every hour.

image.png

Points tested and analysis done:

  1. This is not happening in test and is not repicable.

  2. Initially changed the permission of the database user accordingly of the superuser and tried to solve the issue, but this failed.

  3. Changed the permission of the data folders/ changed ownership in patroni to have access to groups and still it fails.

  4. Changed the database connection to super user and found it is happening for both the superuser and the databaseuser.

  5. Completely removed the Patroni statefulset in dev with the PVC and other configurations and reinstalled new Patroni with Patroni-1 as leader. Still the error persists after db recovered from db recovery pod.

  6. Running the sql that was causing the error in parallel execution of the assessment-workflow-queue-retry queue is not causing any failures in the pgadmin software. SELECT "StudentAssessment"."id" AS "StudentAssessment_id" FROM "sims"."student_assessments" "StudentAssessment" WHERE "StudentAssessment"."student_assessment_status" = 'Cancellation queued' AND "StudentAssessment"."student_assessment_status_updated_on" < '2024-02-08T19:49:01.902Z' enqueueCancelAssessmentRetryWorkflows

  7. Actual error found in the queue consumer

    image.png
  8. Kibana logs to check for further investigation https://kibana-openshift-logging.apps.silver.devops.gov.bc.ca/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30m,mode:quick,to:now))&_a=(columns:!(message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'8841c680-a15b-11eb-a4dc-e5bf19f04239',key:kubernetes.namespace_name,negate:!f,params:(query:'0c27fb-dev',type:phrase),type:phrase,value:'0c27fb-dev'),query:(match:(kubernetes.namespace_name:(query:'0c27fb-dev',type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'8841c680-a15b-11eb-a4dc-e5bf19f04239',key:kubernetes.pod_name,negate:!f,params:(query:patroni-1,type:phrase),type:phrase,value:patroni-1),query:(match:(kubernetes.pod_name:(query:patroni-1,type:phrase))))),index:'8841c680-a15b-11eb-a4dc-e5bf19f04239',interval:auto,query:(language:lucene,query:'%22FATAL:%20%20the%20database%20system%20is%20in%20recovery%20mode%22'),sort:!('@timestamp',desc))

    image.png

    Note: The patroni leader changed in the statefulset from Patroni 2 to Patroni 1, that is the reason the screenshot and the search will have different patroni pod_name. As part of the analysis completely reinstalled it.

Possible Cause:

The Data in the dev might be the reason that might cause this issue, to further investigate on the analysis a complete new db without any data from scratch can be used to test it. Also this seems to be a non issue for the Prod environment as part of the above analysis.

guru-aot commented 8 months ago

As a part of the analysis the relationId in the db models created sub query creation, that causes DB to go into recovery mode. So removing it will enhance the queries performance and eventually make 50K records easily.

guru-aot commented 8 months ago

image