Move PROD to Google Cloud SQL

dlebrero commented 4 years ago

Last RSR outage was triggered (not caused) by RSR failing to repond to the health checks and k8s killing the backend container due to it.

Part of the reason for RSR to fail the health check was that the DB was 100% used, which slowed down RSR enough to fail the checks.

Instead of getting a bigger ElephantSQL move to Google Cloud SQL as we can control more the spec of the DB servers. This will cut the cost of the DB to half.

dlebrero commented 4 years ago

## Make sure that kubectl is pointing to production cluster.
kubectx gke_akvo-lumen_europe-west1-d_production

Prepare:

## Run the migration container
kubectl apply -f ci/k8s/db-migration/db-migration.yml
## Once it is running, copy the migration scripts to the container:
kubectl cp scripts/data/ $(kubectl get pods -l "app=rsr-db-migration" -o jsonpath="{.items[0].metadata.name}"):/tmp -c rsr-db-migration

Run migration:

## Stop RSR and ReportServer.
kubectl scale --replicas=0 deployment/rsr
kubectl scale --replicas=0 deployment/reportserver

## 
kubectl exec $(kubectl get pods -l "app=rsr-db-migration" -o jsonpath="{.items[0].metadata.name}") -it bash -c rsr-db-migration
cd /tmp/data
bash make_dump.sh
bash restore_from_dump.sh
bash make-dump-report-server.sh
bash restore-from-dump-report-server.sh

If everything goes well, merge https://github.com/akvo/akvo-config/pull/163 and wait for CircleCI to finish the build.

[x] Deploy new version of RSR: merge develop to master.
[x] Deploy new version of ReportServer: merge develop to master.
[x] Go to ReportServer UI and change RSR datasources. See details in https://github.com/akvo/akvo-config/blob/master/k8s-secrets/production/reportserver/readonly_rsr_db_user_details

Clean up:

kubectl delete -f ci/k8s/db-migration/db-migration.yml

[x] In a week or so, delete the RSR production database.

dlebrero commented 4 years ago

Unexpected issues found:

The report server dump did not restore correctly due to a duplicated key on the large objects: Postgres stores large objects outside the schema and this was the second time that we did a restore from a dump with large objects. We opted for doing another dump without large objects.
ReportServer read only user was not able to read any RSR table. We tried to reapply the user permissions but it still did not work. We opted for creating a new read-only user.

akvo / akvo-rsr

Move PROD to Google Cloud SQL #4080

Prepare:

Clean up: