bcgov / nr-fom

Forestry Operations Map
Apache License 2.0
0 stars 1 forks source link

FOM Database Backup Container Failing #637

Closed ianliuwk1019 closed 3 months ago

ianliuwk1019 commented 3 months ago

Describe the Bug Catherine received alerts every morning from Openshift about issues on several backup containers. After having a look, current FOM production (and all environments) database backups are failing. However, there are two database backup currently running, one from the old time (named as fom-[env]-backups), and one from OpenShift deployment (fom-[env]-database-backups, see screenshot).

The investigation from Basil found out the failing one from OpenShift deployment is having wrong database secret and wrong database key. image

We need to fix the bug so after deployment, the the backup container can backup database again (also see additional context).

Expected Behaviour & Acceptance Criteria

Screenshots PROD: image TEST: image DEMO: image DEV: image

Additional context

ianliuwk1019 commented 3 months ago

Although we had update the "db" component's openshift.deploy.yml for backup storage to 1Gi, during old deployment (on DEV, PR-39), the storage PVC still shows 256Mi. Tried closing and reopening the PR (after delete/cleaning the same-pr-storage) does not work, it still deploy with 256Mi. Is it a bug in OpenShift??? Closed the pr again and open a "new" PR from the same branch, this time it deploy as "PR-41" and the storage is now 1Gi. Image

It is strange, somehow OpenShift knows to link to previous PVC, for reason unknown. Although DEV environment seems to be good now, we need to verify if deploying to TEST will face the same issue. @basilv @MCatherine1994

ianliuwk1019 commented 3 months ago

Yes, after deployment to TEST it faces the same issue like in DEV. The storage volume is still 256Mi. Image The OpenShift console has an option to "Expand PVC", maybe I will wait for new TEST deployment backup container to run the first time tonight and see if it changes, if not, then try that option and see...

ianliuwk1019 commented 3 months ago

For this ticket, we had to fix the wrong 'secret', wrong 'secret key', wrong 'DATABASE_SERVICE_NAME' and adjust backup storage to 1Gi from the OpenShift template.

However, on OpenShift deployment, we encountered configurations were not updated (thus still failing) for the CronJob. So before deploying to PROD, we need to:

@basilv

basilv commented 3 months ago

@ianliuwk1019 I believe the PVC size not increasing is a known issue that I've run into before. The need to clean up failing jobs is weirder, but I'm fine with that. I'm surprised you even need to delete the CronJob resource itself, did you test whether updates to it can be applied without deleting it?

ianliuwk1019 commented 3 months ago

Hi @basilv , no I didn't update the resource itself, I initially delete the failed jobs, delete the failed pods. My understanding was it is from our template and it should be updated according to our template; but for some reason it was not updated to the resource after new deployment. So Catherine and I went into the resource and delete it, then deploy it without any code change and this time it gets the new configuration from our template.

ianliuwk1019 commented 3 months ago

Update for TEST backup container run: Image

basilv commented 3 months ago

Looks good!

ianliuwk1019 commented 3 months ago

PROD database backup successful: Image 7% used: Image

ianliuwk1019 commented 3 months ago

Now we have successful backups for all environments, but not sure how we can access backup files. One documentation found has this to use "oc rsync" on mountPath to copy files to local: Image It seems this requires a running pod; but the CronJob pod already finished (after few seconds), so might not be easy to access. Probably need to create some new deployment config for easy access. I assume the ticket #638 will have more detail procedure.

basilv commented 3 months ago

The restore process would involve getting the backup files onto the database server pod and running the restore process there. We don't need (or want) such files locally.

ianliuwk1019 commented 3 months ago

Hi @basilv , do you want your alert email setup changed from 'gov' to 'cgi'? Last time Catherine checks she mentioned yours was with 'gov' so you probably didn't get notified for container failing alert; however, it might be too much emails for you...