department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

Near No-Downtime: Redis Backup and Restoration Strategy #78048

Open LindseySaari opened 6 months ago

LindseySaari commented 6 months ago

9/10/2024: This ticket was determined to be out of scope for the Redis POAM (a _nice to have rather than a must have, so it's being rescheduled and placed in the Near No-Downtime Upgrades Epic.

Description

Currently with the Redis tool, there is only a once-daily backup. There are replicas that are captured (as read-only) on a regular basis throughout the day that can be promoted, but it can not be edited. In order to have a more solid, reliable and consistent backup, in the event of a system failure, it is necessary to have a more regularly cadenced backup for Redis. Redis only offers a once-daily backup currently and it won't matter to VA if we have a 24-hour backup or a 12-hour backup; if data is lost it with the cache that will not be a good thing. The Rails cache, Apps cache and sidekiq Redis instances will need to be reviewed.

Tasks

Success Metrics

Acceptability Criteria

LindseySaari commented 5 months ago

Notes: In terms of the issue were facing here, it doesn't appear that we would game anything from partitioning the data. ElastiCache for Redis Cluster Mode introduces sharding and automatic data partitioning across multiple nodes. If our primary concern is preventing data loss in the event of a Redis cluster failure, focusing on data durability and disaster recovery strategies seems to be key here. While Redis Cluster Mode enhances availability, scalability, and performance through sharding and automatic failover, we could benefit by complementing these features with a more robust backup and recovery process.

LindseySaari commented 5 months ago

Postgres approach notes:

We could potentially utilize Sidekiq's middleware or lifecycle hooks to interact with Postgres. We could for example:

This redundancy will act as a layer of protection to prevent data loss in the event of a disaster. We will want to ensure that the additional database calls do not negatively impact performance.