Closed ianliuwk1019 closed 3 months ago
Although we had update the "db" component's openshift.deploy.yml
for backup storage to 1Gi, during old deployment (on DEV, PR-39), the storage PVC still shows 256Mi.
Tried closing and reopening the PR (after delete/cleaning the same-pr-storage) does not work, it still deploy with 256Mi. Is it a bug in OpenShift???
Closed the pr again and open a "new" PR from the same branch, this time it deploy as "PR-41" and the storage is now 1Gi.
It is strange, somehow OpenShift knows to link to previous PVC, for reason unknown. Although DEV environment seems to be good now, we need to verify if deploying to TEST will face the same issue. @basilv @MCatherine1994
Yes, after deployment to TEST it faces the same issue like in DEV. The storage volume is still 256Mi. The OpenShift console has an option to "Expand PVC", maybe I will wait for new TEST deployment backup container to run the first time tonight and see if it changes, if not, then try that option and see...
For this ticket, we had to fix the wrong 'secret', wrong 'secret key', wrong 'DATABASE_SERVICE_NAME' and adjust backup storage to 1Gi from the OpenShift template.
However, on OpenShift deployment, we encountered configurations were not updated (thus still failing) for the CronJob. So before deploying to PROD, we need to:
Delete all current failing "jobs", "pods" and CronJobs related to **fom-prod-database-backup**
Manually "Expand PVC" for current used backup volume (OpenShift will not increase it even if configuration has been increased in template for unknown reason).
@basilv
@ianliuwk1019 I believe the PVC size not increasing is a known issue that I've run into before. The need to clean up failing jobs is weirder, but I'm fine with that. I'm surprised you even need to delete the CronJob resource itself, did you test whether updates to it can be applied without deleting it?
Hi @basilv , no I didn't update the resource itself, I initially delete the failed jobs, delete the failed pods. My understanding was it is from our template and it should be updated according to our template; but for some reason it was not updated to the resource after new deployment. So Catherine and I went into the resource and delete it, then deploy it without any code change and this time it gets the new configuration from our template.
Update for TEST backup container run:
Looks good!
PROD database backup successful: 7% used:
Now we have successful backups for all environments, but not sure how we can access backup files. One documentation found has this to use "oc rsync" on mountPath to copy files to local: It seems this requires a running pod; but the CronJob pod already finished (after few seconds), so might not be easy to access. Probably need to create some new deployment config for easy access. I assume the ticket #638 will have more detail procedure.
The restore process would involve getting the backup files onto the database server pod and running the restore process there. We don't need (or want) such files locally.
Hi @basilv , do you want your alert email setup changed from 'gov' to 'cgi'? Last time Catherine checks she mentioned yours was with 'gov' so you probably didn't get notified for container failing alert; however, it might be too much emails for you...
Describe the Bug Catherine received alerts every morning from Openshift about issues on several backup containers. After having a look, current FOM production (and all environments) database backups are failing. However, there are two database backup currently running, one from the old time (named as fom-[env]-backups), and one from OpenShift deployment (fom-[env]-database-backups, see screenshot).
The investigation from Basil found out the failing one from OpenShift deployment is having wrong database secret and wrong database key.
We need to fix the bug so after deployment, the the backup container can backup database again (also see additional context).
Expected Behaviour & Acceptance Criteria
Screenshots PROD: TEST: DEMO: DEV:
Additional context