FOM Database Backup Container Failing

ianliuwk1019 commented 3 months ago

Describe the Bug Catherine received alerts every morning from Openshift about issues on several backup containers. After having a look, current FOM production (and all environments) database backups are failing. However, there are two database backup currently running, one from the old time (named as fom-[env]-backups), and one from OpenShift deployment (fom-[env]-database-backups, see screenshot).

The investigation from Basil found out the failing one from OpenShift deployment is having wrong database secret and wrong database key.

We need to fix the bug so after deployment, the the backup container can backup database again (also see additional context).

Expected Behaviour & Acceptance Criteria

[x] For each environment, the backup CronJob (for fom-[env]-database-backups) can run successfully with no error in log.
[x] The backup log indicates a newly created backup file and location it is saved. ~- [ ] We can find the backup on OpenShift storage volume.~ (leave it to ticket #638)
[x] Catherine no longer receive warning email due to this bug every morning.
[x] Additionally, we need to set (in OpenShift template) the storage claim to be 1GiB (currently it is 256 MiB).
[x] After all environments are confirmed success on deploying the database backup CronJobs, remove the previous one with named as fom-[env]-backups
[x] Also remove "fom-demo-batch" as this had been rename last time and is no longer needed.
[x] Ask if Basil would like to receive warning emails from OpenShift alert settings.

Screenshots PROD: TEST: DEMO: DEV:

Additional context

This database backup failed on all Openshift environments, including demo.
We are not likely needed to adjust the schedule time from database backup CroneJob (currently it is set at 8 am UTC).

ianliuwk1019 commented 3 months ago

Although we had update the "db" component's openshift.deploy.yml for backup storage to 1Gi, during old deployment (on DEV, PR-39), the storage PVC still shows 256Mi. Tried closing and reopening the PR (after delete/cleaning the same-pr-storage) does not work, it still deploy with 256Mi. Is it a bug in OpenShift??? Closed the pr again and open a "new" PR from the same branch, this time it deploy as "PR-41" and the storage is now 1Gi.

It is strange, somehow OpenShift knows to link to previous PVC, for reason unknown. Although DEV environment seems to be good now, we need to verify if deploying to TEST will face the same issue. @basilv @MCatherine1994

ianliuwk1019 commented 3 months ago

Yes, after deployment to TEST it faces the same issue like in DEV. The storage volume is still 256Mi. The OpenShift console has an option to "Expand PVC", maybe I will wait for new TEST deployment backup container to run the first time tonight and see if it changes, if not, then try that option and see...

ianliuwk1019 commented 3 months ago

For this ticket, we had to fix the wrong 'secret', wrong 'secret key', wrong 'DATABASE_SERVICE_NAME' and adjust backup storage to 1Gi from the OpenShift template.

However, on OpenShift deployment, we encountered configurations were not updated (thus still failing) for the CronJob. So before deploying to PROD, we need to:

Delete all current failing "jobs", "pods" and CronJobs related to **fom-prod-database-backup**
Manually "Expand PVC" for current used backup volume (OpenShift will not increase it even if configuration has been increased in template for unknown reason).

@basilv

basilv commented 3 months ago

@ianliuwk1019 I believe the PVC size not increasing is a known issue that I've run into before. The need to clean up failing jobs is weirder, but I'm fine with that. I'm surprised you even need to delete the CronJob resource itself, did you test whether updates to it can be applied without deleting it?

ianliuwk1019 commented 3 months ago

Hi @basilv , no I didn't update the resource itself, I initially delete the failed jobs, delete the failed pods. My understanding was it is from our template and it should be updated according to our template; but for some reason it was not updated to the resource after new deployment. So Catherine and I went into the resource and delete it, then deploy it without any code change and this time it gets the new configuration from our template.

ianliuwk1019 commented 3 months ago

Update for TEST backup container run:

basilv commented 3 months ago

Looks good!

ianliuwk1019 commented 3 months ago

PROD database backup successful: 7% used:

ianliuwk1019 commented 3 months ago

Now we have successful backups for all environments, but not sure how we can access backup files. One documentation found has this to use "oc rsync" on mountPath to copy files to local: It seems this requires a running pod; but the CronJob pod already finished (after few seconds), so might not be easy to access. Probably need to create some new deployment config for easy access. I assume the ticket #638 will have more detail procedure.

basilv commented 3 months ago

The restore process would involve getting the backup files onto the database server pod and running the restore process there. We don't need (or want) such files locally.

ianliuwk1019 commented 3 months ago

Hi @basilv , do you want your alert email setup changed from 'gov' to 'cgi'? Last time Catherine checks she mentioned yours was with 'gov' so you probably didn't get notified for container failing alert; however, it might be too much emails for you...

bcgov / nr-fom

FOM Database Backup Container Failing #637