department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

Discovery: Sync AWS Maintenance Windows with PagerDuty #77698

Closed LindseySaari closed 4 months ago

LindseySaari commented 6 months ago

Description

Our tools regularly undergo maintenance. Maintenance windows within AWS include various operations, including software updates, security patches, and system upgrades (planned and unplanned - example: a CVE patch). However, these maintenance activities are not currently reflected in PagerDuty, leading to potential alert fatigue and unnecessary escalations during planned downtimes. To streamline our incident management process and reduce noise for oncall folks, we need a solution to automatically sync AWS maintenance windows with PagerDuty maintenance windows.

Tasks

Success Criteria

Acceptability Criteria

jennb33 commented 4 months ago

Need to talk with support about available maintenance windows. Moving this ticket to the next Sprint.

jennb33 commented 4 months ago

Need to investigate how maintenance windows are set:

Terraform documentation: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/elasticache_replication_group.html#maintenance_window

jennb33 commented 4 months ago

Thinking that we will have to also include DevOps. We can go Terraform, AWS consoles or automation for patches. @rjohnson2011 hopes to have this written up in the next day.

rjohnson2011 commented 4 months ago

Here are discovery findings for PagerDuty automation for each of the following:

AWS Console

AWS can be integrated with PagerDuty via CloudWatch using the event rule "EC2 Instance System Maintenance" - https://support.pagerduty.com/docs/amazon-cloudwatch-integration-guide

  1. Set up the PagerDuty Integration in AWS Go to the AWS Management Console and navigate to the CloudWatch service. In the CloudWatch console, go to "Configuration" > "AWS Service Delivery" > "Integration with AWS Services." Click on "Pagerduty" and follow the instructions to set up the integration by providing your PagerDuty Service API Key.
  2. Create a CloudWatch Event Rule: In the CloudWatch console, go to "Events" > "Rules" and click on "Create rule." Under "Event Source," choose "AWS services" and select "EC2" as the service name. For "Event Type," select "EC2 Instance System Maintenance" event types.
  3. Configure the Target: Under "Targets," choose "PagerDuty" from the list of available target types.
  4. Review and Create the Rule: Review the rule details and click on "Create rule" to save the CloudWatch Event Rule.
  5. Test the Integration: You can test the integration by manually triggering a test event in the CloudWatch console. Go to "Events" > "Rules" and select the rule you just created. Click on "Test" and follow the instructions to create a test event. Verify that you receive an alert in PagerDuty for the test event.

Terraform

A maintenance window argument is passed in terraform. https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/elasticache_replication_group.html#maintenance_window One possible way to automate this would be through Github Actions:

  1. Set up a GitHub Actions Workflow: In your GitHub repository, create a new workflow file (e.g., .github/workflows/update-pagerduty-maintenance.yml). This workflow will be triggered whenever your Terraform files are updated or when a new maintenance window is defined.
  2. Define the Workflow Steps: The workflow is triggered whenever files in the terraform/ directory are updated. The Parse Terraform Files step should contain the logic to parse your Terraform files and extract the maintenance window settings. The parsed settings are then stored as an output. The Update PagerDuty Maintenance Window step uses the official PagerDuty Action to update the maintenance window in PagerDuty based on the extracted settings.

Below is example code for the following--

name: Update PagerDuty Maintenance Window

on:
  push:
    paths:
      - 'terraform/**'  # Trigger the workflow when Terraform files are updated

jobs:

  update-maintenance-window:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3

    - name: Parse Terraform Files
      id: parse-terraform
      run: |
        # Parse Terraform files to extract maintenance window settings
        # Example: maintenance_window=$(parse_maintenance_window_from_tf_files)
        echo "::set-output name=maintenance_window::$maintenance_window"

    - name: Update PagerDuty Maintenance Window
      uses: PagerDuty/pagerduty-action@v1
      with:
        token: ${{ secrets.PAGERDUTY_TOKEN }}
        maintenance_window_id: ${{ secrets.PAGERDUTY_MAINTENANCE_WINDOW_ID }}
        start_time: ${{ steps.parse-terraform.outputs.maintenance_window.start_time }}
        end_time: ${{ steps.parse-terraform.outputs.maintenance_window.end_time }}
        description: ${{ steps.parse-terraform.outputs.maintenance_window.description }}

Automation for Patches

This can be accomplished by sending PagerDuty a notification during automation for patches. The workflow for applying patch updates can send a PagerDuty API call during the initial stages of the process. When the patch update workflow is complete another PagerDuty API call can close the maintenance window and reactivate PagerDuty. Going to inquire during standup about more detail for the type of patch automation to occur to determine this workflow can support all use cases.

jennb33 commented 4 months ago

@LindseySaari and @Kshitiz-devops to review this next.

LindseySaari commented 4 months ago

Nice work on this @rjohnson2011

My questions and comments:

Kshitiz-devops commented 4 months ago

Sync using terraform changes sounds good and we should rather be using terraform output to get the maintenance window rather than parsing the value I think. Lambda event based system for maintenance window sounds like good option. I would like to know more about it.

Is the Cloudwatch setup with pagerduty already tested? Will you need my help with that?

rjohnson2011 commented 4 months ago

Here is some supplemental information addressing the additional comments:

Integration with RDS and ElastiCache

The process to integrate AWS maintenance windows with PagerDuty for RDS and ElastiCache is similar to the one outlined for EC2 instances.

For RDS:

Set up the PagerDuty Integration in AWS as mentioned earlier. Create a CloudWatch Event Rule: In the CloudWatch console, go to "Events" > "Rules" and click on "Create rule." Under "Event Source," choose "AWS services" and select "RDS" as the service name. For "Event Type," select the "RDS Maintenance Window" event types. Configure the Target, Review, and Create the Rule following the same steps as for EC2.

For ElastiCache:

Set up the PagerDuty Integration in AWS. Create a CloudWatch Event Rule: In the CloudWatch console, go to "Events" > "Rules" and click on "Create rule." Under "Event Source," choose "AWS services" and select "ElastiCache" as the service name. For "Event Type," select the "ElastiCache Maintenance Window" event types. Configure the Target, Review, and Create the Rule following the same steps as for EC2.

Automation for Security Vulnerability Patches

For security vulnerability patches, we can leverage the same approach of integrating CloudWatch events with PagerDuty. AWS services like EC2, RDS, and ElastiCache generate CloudWatch events when security patches are available or applied. We can create CloudWatch Event Rules to capture these events and trigger a PagerDuty incident or maintenance window. For example, for EC2 instances, we can create a CloudWatch Event Rule to capture the "EC2 Instance State-change Notification" event type and filter for the "system-reboot" event code, which is generated when an instance is rebooted after a security patch is applied.

Event-driven Automation with Lambda or GitHub Actions

In addition to the CloudWatch integration, we can also automate the process of creating and updating PagerDuty maintenance windows using event-driven architectures like AWS Lambda or GitHub Actions.

AWS Lambda Approach:

Create an AWS Lambda function that integrates with the PagerDuty API to create, update, or delete maintenance windows. Configure CloudWatch Event Rules to trigger the Lambda function based on specific events, such as when a maintenance window is scheduled in AWS or when a security patch is available. The Lambda function can parse the event data, extract the relevant details (e.g., maintenance window start and end times, resource IDs), and make the corresponding API calls to PagerDuty to manage the maintenance window.

GitHub Actions Approach:

Set up a GitHub Actions workflow in your infrastructure-as-code repository (e.g., Terraform) that triggers whenever Terraform files are updated or when a new maintenance window is defined. In the workflow, include steps to parse the Terraform files and extract the maintenance window settings. Use the official PagerDuty Action to update the maintenance window in PagerDuty based on the extracted settings from the Terraform files.

Both the Lambda and GitHub Actions approaches provide event-driven automation for managing PagerDuty maintenance windows, either based on AWS events or changes to your infrastructure-as-code repository.