Closed krisstanton closed 4 months ago
WIP Update: I've got the Postgres Engine Version (Database) Upgrade steps worked out. Here are the steps listed, and then a ton of raw notes I was taking while working these steps out.
// Note: these steps have been added to the ticket description up top. (1) Backup the DB: Create Snapshots of Current DB State (to ensure we have backup) (2) Create a clone of the Database (to be a dry run upgrade) (3) Copy the Current Cluster Parameter Group (4) Do Manual DB Changes (switching engine version from 11.21 to 13.12 (5) Do RDS Cluster Code Changes (6) Do a full Deploy (which includes the RDS Cluster Code Changes) (7) Run a SmokeTest (Verify that the test and ORCA works)
// Raw Notes Detail for the above steps. (1)
-Sandbox Account csda-cumulus-sbx-7894
Reference:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateSnapshot.html
-Taking the Snapshot
-Navigate to the AWS RDS Interface page
https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#databases:
Click Snapshot
Click Take Snapshot
Select: DB Cluster
Select: cumulus-kris-sbx7894-rds-serverless
Snapshot Name: Pre-DB-upgrade-11-21-to-13-kris
Wait for the Delay while it gets created.
Did this for all 3 sandboxes
Pre-DB-upgrade-11-21-to-13 // Should have also put a 'kris' on this name
Pre-DB-upgrade-11-21-to-13-jayanthi
Pre-DB-upgrade-11-21-to-13-chuckwondo
(2) Create a clone of the Database (to be a dry run upgrade)
Make a Clone of the Database (so we can do a switch deployment)
cumulus-kris-sbx7894-rds-serverless
cumulus-kris-sbx7894-rds-serverless-clone
(3) Copy the Current Cluster Parameter Group
// NOTE: At the end of the DB Upgrade, your copy will no longer be relevant, the deploy will override the manually created ones - it can be used as a reference if needed
// Note: I did both, AWS Copy and the command line below to get a JSON output of the entire record)
Easiest way is to do this via AWS Console where you just make a copy and give it a slightly different name.
Make a copy of the current Parameter group for version 11 (this is another form of backup)
To see all the params for postgres 13: "aurora-postgresql13"
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Reference.ParameterGroups.html
DOTENV=.env.sandbox make bash
aws rds describe-db-cluster-parameters --db-cluster-parameter-group-name "cumulus-kris-sbx7894-cluster-parameter-group" >> pg_v11__cumulus-kris-sbx7894-cluster-parameter-group.json
(4) Do Manual DB Changes (switching engine version from 11.21 to 13.12)
Verification BEFORE any changes
DOTENV=.env.sandbox make bash
cumulus version
cumulus stats summary
cumulus stats count
(4a) Creating a new "cluster parameter group"
// Example
https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#create-parameter-group:
Parameter Group Name: cumulus-kris-sbx7894-cluster-parameter-group-13
Do a compare of the 2 groups from version 11 to 13 (to see the differences)
Make note of these specific fields (these were different in the sandbox version)
// These are the values found in the version 11 that may need to be set in the version 13
max_replication_slots 10
rds.force_autovacuum_logging_level: INFO
shared_preload_libraries: pg_stat_statements,auto_explain
(4b) Upgrading the Clone (to ensure there were no errors)
https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#databases:
Select: cumulus-kris-sbx7894-rds-serverless-clone
Click Modify
Change Engine Version
FROM "Aurora PostgreSQL (compatible with PostgreSQL 11.21) - default for major version 11"
TO "Aurora PostgreSQL (compatible with PostgreSQL 13.12) - default for major version 13"
Set the new Parameter Group
Set to: cumulus-kris-sbx7894-cluster-parameter-group-13
Change other settings
Additional changes for the modification round
-Setting: "Deletion Protection" --- this needs to be set to "Enabled" (Setting is near the very bottom on the Modify page)
-Setting: "Force scaling the capacity to the specified values when the timeout is reached" --- this needs to be set to "Enabled" (Setting is under the "Additional scaling configuration" Settings -- It's a radio button)
Click Continue
Select, "Apply immediately"
(4c) Wait for the Clone to finish upgrading, then upgrade the original DB in a similar way
Upgrade the 'real' Sandbox DB (to Engine Version 13.12)
https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#databases:
Select the Database (radio Button)
-Click Modify
-Make the changes as described above under the clone changes
(4d) Running Verification (API Calls)
DOTENV=.env.sandbox make bash
cumulus version
cumulus stats summary
cumulus stats count
DB 13 is working so far.
(4e) Running the Smoke test to see if everything still works
DOTENV=.env.sandbox make bash
cumulus rules enable --name PSScene3Band___1_SmokeTest
cumulus rules run --name PSScene3Band___1_SmokeTest
On Sandbox Account (7894), the Smoke test worked
On DR UAT Account (6741), checking now
https://us-west-2.console.aws.amazon.com/s3/buckets/csda-cumulus-cba-uat-orca-archive?region=us-west-2&bucketType=general&prefix=planet/PSScene3Band/&showversions=false
ORCA still works
(5) Do RDS Cluster Code Changes // Reference: https://github.com/NASA-IMPACT/csdap-cumulus/commit/0710d25997d0bce2457857fd2b814eef19f85057
(6) Do RDS Cluster Deploy
Deployment
DOTENV=.env.sandbox make all-init
DOTENV=.env.sandbox make all-up-yes
DOTENV=.env.sandbox make bash
(7) Do a smoke test
DOTENV=.env.sandbox make bash
cumulus rules enable --name PSScene3Band___1_SmokeTest
cumulus rules run --name PSScene3Band___1_SmokeTest
Appendix for Step 6 - Hit a rough patch with the Deployment. Ran into the same problem I had with the last upgrade deployment. I made some code changes to bypass some of the yarn stuff regarding linter and unit tests. Everything from terraform looked fine and the normal smoke test worked. (See future Pull Request for this task to see the exact code changes)
There is an extra step for Sandbox deployments (AFTER DB Changes - And succesfull 18.1.0 deployment)
// Note: This change can ONLY be made manually AFTER a version 18.1.0 deployment that has the Postgres13 manual upgrade completed. // Note: This is the manual step that has to happen before a version 18.2.0 deployment will work
-Go into the Sandbox Server's AWS Dashboard
-Go to region us-west-2 and then to RDS
-Click on databases
-Find and select your database that is currently in operation (not the clone)
-Click Modify
-Near the bottom,
Change "DB cluster Parameter group"
from: cumulus-kris-sbx7894-cluster-parameter-group // EXAMPLE
to: cumulus-kris-sbx7894-cluster-parameter-group-13 // EXAMPLE (Choose the one that has a '-13' at the end)
-Click Continue
-Select "Apply immediately"
-This should be a very fast change.
-After this is done, you should be able to successfully deploy version 18.2.0 to the sandbox (Note, I had a successful smoke test after doing this as well)
-After Version 18.2.0 is deployed, you can verify this worked by checking the configuration on the database. The new parameter group should be something like this: cumulus-kris-sbx7894-cluster-parameter-group-v13
with an added 'v' in it. Also this cluster parameter group should be managed by terraform.
WIP Update: Running into a problem with Github's UAT deployment.
It seems there is an error when github specifically tries to do a UAT deployment. There error is:
terraspace plan cumulus: Error: error archiving directory: could not archive missing directory: /home/runner/work/csdap-cumulus/csdap-cumulus/build/main
Error running: terraspace plan cumulus. Fix the error above or check logs for the error.
Error: Process completed with exit code 2.
Note: We had two sets of successful sandbox deployments.
This version of Cumulus requires a database upgrade - so split this ticket out from https://github.com/NASA-IMPACT/csdap-cumulus/issues/355
[x] Complete the ORCA upgrade first - Ticket: ticket out from https://github.com/NASA-IMPACT/csdap-cumulus/issues/355
[x] Work Method out with Sandbox
[x] Perform upgrade via sandbox
DOTENV=.env.sandbox make all-init
DOTENV=.env.sandbox make all-up-yes
DOTENV=.env.sandbox make bash
DOTENV=.env.sandbox make bash
cumulus rules enable --name PSScene3Band___1_SmokeTest
cumulus rules run --name PSScene3Band___1_SmokeTest
[x] Create the First Pull Request (to push DB Upgrades to UAT and then PROD)
[ ] Continue with Cumulus Upgrade (New Branch, Code Changes, version switch and Deploy)
// Older, but still valid Checklist
Upgrade Steps (After any Code and/or Migration Changes)
make all-init
make all-up-yes
Current Cumulus and Orca Version information
References Current Cumulus Version: v18.1.0
Cumulus Upgrade Research Reference
Link to the last Cumulus upgrade ticket