Closed dlebrero closed 4 years ago
## Make sure that kubectl is pointing to production cluster.
kubectx gke_akvo-lumen_europe-west1-d_production
## Run the migration container
kubectl apply -f ci/k8s/db-migration/db-migration.yml
## Once it is running, copy the migration scripts to the container:
kubectl cp scripts/data/ $(kubectl get pods -l "app=rsr-db-migration" -o jsonpath="{.items[0].metadata.name}"):/tmp -c rsr-db-migration
Run migration:
## Stop RSR and ReportServer.
kubectl scale --replicas=0 deployment/rsr
kubectl scale --replicas=0 deployment/reportserver
##
kubectl exec $(kubectl get pods -l "app=rsr-db-migration" -o jsonpath="{.items[0].metadata.name}") -it bash -c rsr-db-migration
cd /tmp/data
bash make_dump.sh
bash restore_from_dump.sh
bash make-dump-report-server.sh
bash restore-from-dump-report-server.sh
If everything goes well, merge https://github.com/akvo/akvo-config/pull/163 and wait for CircleCI to finish the build.
kubectl delete -f ci/k8s/db-migration/db-migration.yml
Unexpected issues found:
The report server dump did not restore correctly due to a duplicated key on the large objects: Postgres stores large objects outside the schema and this was the second time that we did a restore from a dump with large objects. We opted for doing another dump without large objects.
ReportServer read only user was not able to read any RSR table. We tried to reapply the user permissions but it still did not work. We opted for creating a new read-only user.
Last RSR outage was triggered (not caused) by RSR failing to repond to the health checks and k8s killing the backend container due to it.
Part of the reason for RSR to fail the health check was that the DB was 100% used, which slowed down RSR enough to fail the checks.
Instead of getting a bigger ElephantSQL move to Google Cloud SQL as we can control more the spec of the DB servers. This will cut the cost of the DB to half.