department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
99 stars 69 forks source link

Manual Rollback Job of vagov-prod deploy #19424

Open 7hunderbird opened 1 month ago

7hunderbird commented 1 month ago

User Story or Problem Statement

Create a job in Jenkins that does steps from the deploy troubleshooting guide.

https://raw.githubusercontent.com/department-of-veterans-affairs/va.gov-cms/main/READMES/devops/deploy-failure-troubleshooting-guide.md

Description or Additional Context

Most of the time we go in and do "2A) ROLLBACK" in the deploy-failure-troubleshooting-guide.

1. Abandon/remove new instance with:

aws autoscaling complete-lifecycle-action \ --region us-gov-west-1 \ --auto-scaling-group-name "dsva-vagov-prod-cms-asg" \ --lifecycle-hook-name launch-hook \ --lifecycle-action-result ABANDON \ --instance-id i-0f256f4eae72d5c87

1. Scale-in ASG from 2 to 1 (this may change later as we move to HA)
1. Remove latest launch-template (first need to set latest-1 to default)
1. Set deployment_version tag on existing instance from ‘previous’ to ‘latest’
1. drush cache:rebuild
1. Run post-live job (removes site-alerts and lets users know deploy is done)
1. rm /var/www/cms/docroot/sites/default/settings/settings.deploy.active.php (settings.deploy.inactive.php should still exist) (this brings in all the traffic again)

It would be great if we could have a job in Jenkins that would just perform the steps once we've determined that we are in the 2A scenario.

An example of this would be when the deploy process has started and the Auto Scaling Group is replacing the EC2 instance but fails to obtain an IP address.

This will be a first step in this work, where the job will be manually run, but then we will want to work on automating this job in future iterations.

Steps for Implementation

Acceptance Criteria

7hunderbird commented 2 weeks ago

Got the job built into Jenkins. The job is currently a "no-op" but the part of adding the job is in this devops PR.

Here's the link: http://jenkins.vfs.va.gov/job/cms-test/job/cms-test-rollback-staging/

7hunderbird commented 5 days ago

Last week I had put the test.staging.cms.va.gov site into the broken state to be able to test the rollback button.

This broken state got into a more broken state because I didn't do the rollback and then when I tried to reset it to the previous state I ran into some problems with that.

  1. The site is showing the initial install page of Drupal.

Image

  1. The cms-test vagov-staging deploy in Jenkins shows a "failure in post deploy tasks."
00:30:40.499    msg: Failure in the post-deploy tasks

Here are the three tasks that fail:

These are the failures from the Jenkins log:

In the Enable Deploy Mode in CMS task:

00:30:39.577  TASK [Enable Deploy Mode in CMS] ***********************************************
00:30:39.577  Friday 08 November 2024  18:55:08 +0000 (0:00:00.027)       0:07:20.384 ******* 
00:30:40.499  fatal: [ip-10-247-35-92.us-gov-west-1.compute.internal]: FAILED! => changed=true 
00:30:40.499    cmd: /bin/bash -lc 'drush va-gov-enable-deploy-mode 2>&1'
00:30:40.499    delta: '0:00:00.551960'
00:30:40.499    end: '2024-11-08 18:55:09.168307'
00:30:40.499    msg: non-zero return code
00:30:40.499    rc: 1
00:30:40.499    start: '2024-11-08 18:55:08.616347'
00:30:40.499    stderr: ''
00:30:40.499    stderr_lines: <omitted>
00:30:40.499    stdout: |2-
00:30:40.499    
00:30:40.499    
00:30:40.499        Command va-gov-enable-deploy-mode was not found. Drush was unable to query
00:30:40.499        the database. As a result, many commands are unavailable. Re-run your comma
00:30:40.499        nd with --debug to see relevant log messages.
00:30:40.499    stdout_lines: <omitted>

In the Drush deploy task:

00:30:39.577  TASK [Drush deploy] ************************************************************
00:30:39.577  Friday 08 November 2024  18:55:07 +0000 (0:00:00.029)       0:07:19.502 ******* 
00:30:39.577  fatal: [ip-10-247-35-92.us-gov-west-1.compute.internal]: FAILED! => changed=true 
00:30:39.577    cmd: /bin/bash -lc 'drush deploy --yes 2>&1'
00:30:39.577    delta: '0:00:00.544530'
00:30:39.577    end: '2024-11-08 18:55:08.289503'
00:30:39.577    msg: non-zero return code
00:30:39.577    rc: 1
00:30:39.577    start: '2024-11-08 18:55:07.744973'
00:30:39.577    stderr: ''
00:30:39.577    stderr_lines: <omitted>
00:30:39.577    stdout: |2-
00:30:39.577    
00:30:39.577      In BootstrapHook.php line 40:
00:30:39.577    
00:30:39.577        Bootstrap failed. Run your command with -vvv for more information.
00:30:39.577    stdout_lines: <omitted>

In the Sync PROD database for downstream environments only (sync-db.sh) task:

00:30:16.632  TASK [Sync PROD database for downstream environments only (sync-db.sh)] ********
00:30:16.632  Friday 08 November 2024  18:54:43 +0000 (0:00:00.300)       0:06:55.215 ******* 
00:30:24.700  fatal: [ip-10-247-35-92.us-gov-west-1.compute.internal]: FAILED! => changed=true 
00:30:24.700    cmd: /bin/bash -lc './scripts/sync-db.sh 2>&1'
00:30:24.700    delta: '0:00:08.719743'
00:30:24.700    end: '2024-11-08 18:54:52.174984'
00:30:24.700    msg: non-zero return code
00:30:24.700    rc: 1
00:30:24.700    start: '2024-11-08 18:54:43.455241'
00:30:24.700    stderr: ''
00:30:24.700    stderr_lines: <omitted>
00:30:24.700    stdout: |-
00:30:24.700      Downloading latest PROD database from: [https://dsva-vagov-prod-cms-test-backup-sanitized.s3-us-gov-west-1.amazonaws.com/database/cms-prod-db-sanitized-latest.sql.gz](https://dsva-vagov-prod-cms-test-backup-sanitized.s3-us-gov-west-1.amazonaws.com/database/cms-prod-db-sanitized-latest.sql.gz%1B[0m)
00:30:24.700        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
00:30:24.700                                       Dload  Upload   Total   Spent    Left  Speed
00:30:24.700        0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   263    0   263    0     0   8755      0 --:--:-- --:--:-- --:--:--  9068
00:30:24.700      Downloaded PROD Database to .dumps/cms-prod-db-sanitized-latest.sql.
00:30:24.700      Dropping existing database tables
00:30:24.700    
00:30:24.700       // Do you really want to drop all tables in the database dsva_cms_staging?:
00:30:24.700       // yes.
00:30:24.700    
00:30:24.700      Database tables dropped
00:30:24.700      Importing .dumps/cms-prod-db-sanitized-latest.sql
00:30:24.700      ./scripts/sync-db.sh: line 27: cms-prod-db-sanitized-latest.sql: No such file or directory
00:30:24.700    stdout_lines: <omitted>
00:30:24.700  ...ignoring
7hunderbird commented 5 days ago

Basically it was failing because it didn't have a "sanitized database" to help setup the instance.

I ran these jobs in this order and it's back to working:

  1. http://jenkins.vfs.va.gov/job/cms-test/job/cms-test-db-backup-prod/
  2. http://jenkins.vfs.va.gov/job/cms-test/job/cms-test-db-sanitize/
  3. http://jenkins.vfs.va.gov/job/deploys/job/cms-test-vagov-staging/