NASA-IMPACT / csdap-cumulus

SmallSat Cumulus Deployment
Other
1 stars 1 forks source link

Upgrade Cumulus from v18.1.0 to v18.2.0 #358

Closed krisstanton closed 4 months ago

krisstanton commented 5 months ago

This version of Cumulus requires a database upgrade - so split this ticket out from https://github.com/NASA-IMPACT/csdap-cumulus/issues/355

// Older, but still valid Checklist

Upgrade Steps (After any Code and/or Migration Changes)

Current Cumulus and Orca Version information

- Cumulus Version:     v18.1.0  // Verified by: https://github.com/NASA-IMPACT/csdap-cumulus/blob/main/config/helpers/cumulus_version_helper.rb
- ORCA Version:         v8.1.0  // Verified by: https://github.com/NASA-IMPACT/csdap-cumulus/blob/3153f8f00194f8c5c3f658a938b9963bf1c55440/app/stacks/cumulus/orca.tf#L18   // source = "https://github.com/nasa/cumulus-orca/releases/download/v8.1.0/cumulus-orca-terraform.zip"
- Terraform Version:     1.5.7  // Verified by: https://github.com/NASA-IMPACT/csdap-cumulus/blob/3153f8f00194f8c5c3f658a938b9963bf1c55440/.terraform-version#L1
- Postgresql Version:   11.21   // Verified by: Checking CBA PROD Account: 5047 RDS DB Clusters --> Configuration --> Engine Version:   // https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#database:id=cumulus-prod-rds-serverless;is-cluster=true;tab=configuration

References Current Cumulus Version: v18.1.0

Cumulus Upgrade Research Reference

Link to the last Cumulus upgrade ticket

krisstanton commented 4 months ago

WIP Update: I've got the Postgres Engine Version (Database) Upgrade steps worked out. Here are the steps listed, and then a ton of raw notes I was taking while working these steps out.

// Note: these steps have been added to the ticket description up top. (1) Backup the DB: Create Snapshots of Current DB State (to ensure we have backup) (2) Create a clone of the Database (to be a dry run upgrade) (3) Copy the Current Cluster Parameter Group (4) Do Manual DB Changes (switching engine version from 11.21 to 13.12 (5) Do RDS Cluster Code Changes (6) Do a full Deploy (which includes the RDS Cluster Code Changes) (7) Run a SmokeTest (Verify that the test and ORCA works)

// Raw Notes Detail for the above steps. (1)

-Sandbox Account    csda-cumulus-sbx-7894
Reference:
    https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateSnapshot.html
-Taking the Snapshot
    -Navigate to the AWS RDS Interface page
        https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#databases:
            Click Snapshot
                Click Take Snapshot
                    Select:             DB Cluster
                    Select:             cumulus-kris-sbx7894-rds-serverless
                    Snapshot Name:      Pre-DB-upgrade-11-21-to-13-kris
                Wait for the Delay while it gets created.
    Did this for all 3 sandboxes
        Pre-DB-upgrade-11-21-to-13   // Should have also put a 'kris' on this name
        Pre-DB-upgrade-11-21-to-13-jayanthi
        Pre-DB-upgrade-11-21-to-13-chuckwondo

(2) Create a clone of the Database (to be a dry run upgrade)

Make a Clone of the Database (so we can do a switch deployment)
    cumulus-kris-sbx7894-rds-serverless 
    cumulus-kris-sbx7894-rds-serverless-clone 

(3) Copy the Current Cluster Parameter Group

// NOTE: At the end of the DB Upgrade, your copy will no longer be relevant, the deploy will override the manually created ones - it can be used as a reference if needed
// Note: I did both, AWS Copy and the command line below to get a JSON output of the entire record)
Easiest way is to do this via AWS Console where you just make a copy and give it a slightly different name.
Make a copy of the current Parameter group for version 11 (this is another form of backup)
    To see all the params for postgres 13: "aurora-postgresql13" 
 https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Reference.ParameterGroups.html

    DOTENV=.env.sandbox make bash
    aws rds describe-db-cluster-parameters --db-cluster-parameter-group-name "cumulus-kris-sbx7894-cluster-parameter-group" >> pg_v11__cumulus-kris-sbx7894-cluster-parameter-group.json

(4) Do Manual DB Changes (switching engine version from 11.21 to 13.12)

Verification BEFORE any changes
    DOTENV=.env.sandbox make bash
    cumulus version
    cumulus stats summary
    cumulus stats count

  (4a) Creating a new "cluster parameter group"
     // Example
     https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#create-parameter-group:
       Parameter Group Name: cumulus-kris-sbx7894-cluster-parameter-group-13
     Do a compare of the 2 groups from version 11 to 13 (to see the differences)
    Make note of these specific fields (these were different in the sandbox version)
      // These are the values found in the version 11 that may need to be set in the version 13
      max_replication_slots                 10
      rds.force_autovacuum_logging_level:   INFO
      shared_preload_libraries:              pg_stat_statements,auto_explain

  (4b) Upgrading the Clone (to ensure there were no errors)
     https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#databases:
    Select:     cumulus-kris-sbx7894-rds-serverless-clone
    Click Modify 
        Change Engine Version 
            FROM    "Aurora PostgreSQL (compatible with PostgreSQL 11.21) - default for major version 11"
            TO      "Aurora PostgreSQL (compatible with PostgreSQL 13.12) - default for major version 13"

        Set the new Parameter Group
            Set to:     cumulus-kris-sbx7894-cluster-parameter-group-13

        Change other settings
            Additional changes for the modification round
                -Setting: "Deletion Protection" --- this needs to be set to "Enabled"  (Setting is near the very bottom on the Modify page)
                -Setting: "Force scaling the capacity to the specified values when the timeout is reached" --- this needs to be set to "Enabled" (Setting is under the "Additional scaling configuration" Settings -- It's a radio button)

        Click Continue
            Select, "Apply immediately"

  (4c) Wait for the Clone to finish upgrading, then upgrade the original DB in a similar way
    Upgrade the 'real' Sandbox DB  (to Engine Version 13.12)
          https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#databases:
        Select the Database (radio Button)
            -Click Modify
                -Make the changes as described above under the clone changes

  (4d) Running Verification (API Calls)
      DOTENV=.env.sandbox make bash
         cumulus version
         cumulus stats summary
         cumulus stats count
             DB 13 is working so far.

  (4e) Running the Smoke test to see if everything still works
    DOTENV=.env.sandbox make bash
        cumulus rules enable --name PSScene3Band___1_SmokeTest
        cumulus rules run --name PSScene3Band___1_SmokeTest

        On Sandbox Account  (7894), the Smoke test worked
        On DR UAT Account   (6741), checking now
            https://us-west-2.console.aws.amazon.com/s3/buckets/csda-cumulus-cba-uat-orca-archive?region=us-west-2&bucketType=general&prefix=planet/PSScene3Band/&showversions=false
            ORCA still works

(5) Do RDS Cluster Code Changes // Reference: https://github.com/NASA-IMPACT/csdap-cumulus/commit/0710d25997d0bce2457857fd2b814eef19f85057

(6) Do RDS Cluster Deploy

Deployment
    DOTENV=.env.sandbox make all-init
    DOTENV=.env.sandbox make all-up-yes
    DOTENV=.env.sandbox make bash

(7) Do a smoke test

    DOTENV=.env.sandbox make bash
    cumulus rules enable --name PSScene3Band___1_SmokeTest
    cumulus rules run --name PSScene3Band___1_SmokeTest

Appendix for Step 6 - Hit a rough patch with the Deployment. Ran into the same problem I had with the last upgrade deployment. I made some code changes to bypass some of the yarn stuff regarding linter and unit tests. Everything from terraform looked fine and the normal smoke test worked. (See future Pull Request for this task to see the exact code changes)

krisstanton commented 4 months ago

There is an extra step for Sandbox deployments (AFTER DB Changes - And succesfull 18.1.0 deployment)

// Note: This change can ONLY be made manually AFTER a version 18.1.0 deployment that has the Postgres13 manual upgrade completed. // Note: This is the manual step that has to happen before a version 18.2.0 deployment will work

-Go into the Sandbox Server's AWS Dashboard -Go to region us-west-2 and then to RDS -Click on databases -Find and select your database that is currently in operation (not the clone) -Click Modify -Near the bottom, Change "DB cluster Parameter group" from: cumulus-kris-sbx7894-cluster-parameter-group // EXAMPLE to: cumulus-kris-sbx7894-cluster-parameter-group-13 // EXAMPLE (Choose the one that has a '-13' at the end) -Click Continue -Select "Apply immediately" -This should be a very fast change. -After this is done, you should be able to successfully deploy version 18.2.0 to the sandbox (Note, I had a successful smoke test after doing this as well) -After Version 18.2.0 is deployed, you can verify this worked by checking the configuration on the database. The new parameter group should be something like this: cumulus-kris-sbx7894-cluster-parameter-group-v13 with an added 'v' in it. Also this cluster parameter group should be managed by terraform.

krisstanton commented 4 months ago

WIP Update: Running into a problem with Github's UAT deployment.

It seems there is an error when github specifically tries to do a UAT deployment. There error is:

terraspace plan cumulus:  Error: error archiving directory: could not archive missing directory: /home/runner/work/csdap-cumulus/csdap-cumulus/build/main
Error running: terraspace plan cumulus. Fix the error above or check logs for the error.
Error: Process completed with exit code 2.

Note: We had two sets of successful sandbox deployments.