department-of-veterans-affairs / notification-api

Notification API
MIT License
16 stars 9 forks source link

Spike: AWS Backups for PostgreSQL #1362

Closed nikolai-efimov closed 1 year ago

nikolai-efimov commented 1 year ago

User Story - Business Goal

To ensure business continuity in case of database failure.

User Story(ies)

As a db admin I want to configure automated backups and be able to restore our Aurora PostgreSQL with lowest RPO possible So that we don't loose any data in case the system goes down

Additional Info and Resources

Recovery Point Objective (RPO) - indicates how much data will be lost during downtime Recovery Time Objective RTO - indicates how long the system is going to be down)

Previous spike 1137 resulted in the following findings:

  1. "Within RDS/Aurora, the limitation is for daily backups. For more granular backups, such as hourly backups, AWS Backups is required" (from support email)
  2. We cannot expect low Recovery Point Objective (RPO), which indicates how much data will be lost during downtime. This depends on when the incident is taking place, off/peak hours, etc.
  3. We don't have control over WAL file archives, so we're at the mercy of AWS Backup mechanism. According to AWS support team "All data present at the time the snapshot is taken is restored. Backups have a RPO of 5 minutes via Point In Time Restores. [2] (from support email)"
  4. There's PITR, but there's no "backtrack" option in RDS/Aurora PostgreSQL. "It will create a new database cluster rather than using the existing cluster. [3] (from support email)"

Acceptance Criteria & Checklist

Completed research addresses the problems and/or answers the questions below:

  1. What's expected Recovery Time Objective RTO (indicates how long the system is going to be down) ?
  2. Are we able to successfully configure backups by following steps in Automated AWS Backups or do we need to follow up with the support team.
mjones-oddball commented 1 year ago

Hey team! Please add your planning poker estimate with Zenhub @cris-oddball @justaskdavidb2 @k-macmillan @kalbfled @ldraney @nikolai-efimov

tabinda-syed commented 1 year ago

If you pick up this ticket, please reach out to NIkolai to chat about it.

nikolai-efimov commented 1 year ago

image (1).pngimage (2).png

nikolai-efimov commented 1 year ago

image (3).png

nikolai-efimov commented 1 year ago

image (4).png

nikolai-efimov commented 1 year ago

Answer from previous question to support:

“refine selection using tags” section refers to logical AND i.e. "if tag1 AND tag2”

Additional questions sent to AWS support team

  1. It's unclear how "Enable continuous backups for point-in-time recovery (PITR)" works together with backup frequency. What would be the difference between setting frequency to hourly (with PITR enabled) and setting backup frequency to daily (with PITR enabled). Could you explain how it works?
  2. We found the following in the documentation "You can create two backup rules in the same backup plan: one continuous backup rule to recover the most current resource state and one snapshot backup rule for long-term retention". What is the difference in effect of doing this as separate rules vs doing it as a single rule. This then relates to question #1
  3. There is an option to set backup frequency with a cron job. What's a good usecase for using this option?
  4. How long does it take to backup ~100GB Aurora cluster? We set "Start within 1 hour Complete within 3 hours", is that appropriate to use those numbers? Is there best practices for these settings?
  5. How can we tell current database size? ("volume bytes used" charts are empty under "monitoring", and nothing under configuration/storage)
  6. How can we tell how much storage is used by backups?
tabinda-syed commented 1 year ago

Nikolai is awaiting a response from the AWS support team (i.e., external blocker). We anticipate this ticket is at risk of not being completed this sprint.

nikolai-efimov commented 1 year ago