Closed guru-aot closed 8 months ago
pulling into sprint and added points as per @ninosamson
As part of the analysis done, the database in dev is going into recovery mode when assessment-workflow-queue-retry queue is running every hour.
Points tested and analysis done:
This is not happening in test and is not repicable.
Initially changed the permission of the database user accordingly of the superuser and tried to solve the issue, but this failed.
Changed the permission of the data folders/ changed ownership in patroni to have access to groups and still it fails.
Changed the database connection to super user and found it is happening for both the superuser and the databaseuser.
Completely removed the Patroni statefulset in dev with the PVC and other configurations and reinstalled new Patroni with Patroni-1 as leader. Still the error persists after db recovered from db recovery pod.
Running the sql that was causing the error in parallel execution of the assessment-workflow-queue-retry queue is not causing any failures in the pgadmin software.
SELECT "StudentAssessment"."id" AS "StudentAssessment_id" FROM "sims"."student_assessments" "StudentAssessment" WHERE "StudentAssessment"."student_assessment_status" = 'Cancellation queued' AND "StudentAssessment"."student_assessment_status_updated_on" < '2024-02-08T19:49:01.902Z'
enqueueCancelAssessmentRetryWorkflows
Actual error found in the queue consumer
Kibana logs to check for further investigation
https://kibana-openshift-logging.apps.silver.devops.gov.bc.ca/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30m,mode:quick,to:now))&_a=(columns:!(message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'8841c680-a15b-11eb-a4dc-e5bf19f04239',key:kubernetes.namespace_name,negate:!f,params:(query:'0c27fb-dev',type:phrase),type:phrase,value:'0c27fb-dev'),query:(match:(kubernetes.namespace_name:(query:'0c27fb-dev',type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'8841c680-a15b-11eb-a4dc-e5bf19f04239',key:kubernetes.pod_name,negate:!f,params:(query:patroni-1,type:phrase),type:phrase,value:patroni-1),query:(match:(kubernetes.pod_name:(query:patroni-1,type:phrase))))),index:'8841c680-a15b-11eb-a4dc-e5bf19f04239',interval:auto,query:(language:lucene,query:'%22FATAL:%20%20the%20database%20system%20is%20in%20recovery%20mode%22'),sort:!('@timestamp',desc))
Note: The patroni leader changed in the statefulset from Patroni 2 to Patroni 1, that is the reason the screenshot and the search will have different patroni pod_name. As part of the analysis completely reinstalled it.
Possible Cause:
The Data in the dev might be the reason that might cause this issue, to further investigate on the analysis a complete new db without any data from scratch can be used to test it. Also this seems to be a non issue for the Prod environment as part of the above analysis.
As a part of the analysis the relationId in the db models created sub query creation, that causes DB to go into recovery mode. So removing it will enhance the queries performance and eventually make 50K records easily.
Patroni going into recovery mode after changing the credentials for workers, api, queue-consumers and load-test-gateway from superuser to database user.