cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Blazer needs to move to a dedicated Read Only DB #276

Open jzbahrai opened 6 months ago

jzbahrai commented 6 months ago

Description

Move Blazer to its own dedicated read only instance.

One of the ways to do this:

  1. On the RDS Proxy side of things, the best I can find for stoping Notify from using a specific instance in the cluster is involved but should work:
  2. Create a new private subnet in the VPC.
  3. Create a new dedicated reader instance in the cluster that exists in the new subnet (we’ve never tried to do this before).
  4. Do not add the new subnet to the RDS Proxy endpoints.
  5. Update Blazer config to only connect to the new reader.

Acceptance Criteria

Given some context, when (X) action occurs, then (Y) outcome is achieved

jzbahrai commented 6 months ago

https://github.com/cds-snc/notification-terraform/pull/1063

jzbahrai commented 6 months ago

So this doesnt work due to this:


╷
│ Error: creating RDS Cluster (notification-canada-ca-dev-cluster) Instance (notification-canada-ca-dev-separate-reader-db-0): InvalidParameterCombination: Subnet group notification-canada-ca-separate-reader-dbdev is different from subnet group of cluster notification-canada-ca-dev-cluster
│   status code: 400, request id: 4ff7d8cb-2458-41cf-b857-7641a2183ae0
│
│   with aws_rds_cluster_instance.notification-canada-ca-separate-reader-db,
│   on rds.tf line 45, in resource "aws_rds_cluster_instance" "notification-canada-ca-separate-reader-db":
│   45: resource "aws_rds_cluster_instance" "notification-canada-ca-separate-reader-db" {
│
╵
Releasing state lock. This may take a few moments...
ERRO[0053] Terraform invocation failed in /Users/jumana/Notify/notification-terraform/env/dev/rds/.terragrunt-cache/0jcydO7GdJVPyQR1_jFgEUzxp-E/-W3lh3rI8VFfRvtQD2cPdpT6ILw/rds  prefix=[/Users/jumana/Notify/notification-terraform/env/dev/rds]
ERRO[0053] 1 error occurred:
    * exit status 1
    * ```
    * 
Ben asked to look into this: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/rds_cluster#replication_source_identifier
you could try creating a new rds cluster and setting that value (replication_source_identifier) to the ARN of the primary instance. They can be in the same subnet then
adriannelee commented 6 months ago

Have to go back to the drawing board. Will pick up during the incidents-related tickets

jzbahrai commented 5 months ago

Things that don't work:

  1. Setting up a separate subnet in the same cluster
  2. Specifying an individual AZ for the blazer instance - if the cluster rebalances, how does the cluster know not to use the blazer AZ (ca-central only has 3 az's, we would be removing one from the cluster in order to use it for blazer - this would reduce the resiliency of the cluster)

I am going to keep this open for another 24hrs, else I will hand it off to core who is going to implement a separate cluster for Blazer

jzbahrai commented 5 months ago

i would like to move this to the Notify Core backlog

ben851 commented 5 months ago

I've put in a PR that implements a separate read instance in the database cluster. Notes:

Originally I wanted to create a complete separate cluster with replication enabled, but this doesn't seem to be supported for aurora databases I also wanted to put in a secondary read only proxy that went to this instance specifically, but that also is not supported When we upgrade to Postgres 15 we can convert this new instance into a serverless instance to save money. I have left the config for this in this PR but commented out.

ben851 commented 5 months ago

Further note, aurora read replicas seem to be best practice for this anyway.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Replication.html#Aurora.Replication.Replicas

jimleroyer commented 5 months ago

@ben851 to open a ticket with AWS to get their opinion on how we can isolate the GCNotify database exclusively for Blazer.

@jimleroyer to open a ticket on the Blazer github issues forum to inquire on best practices with Blazer in production, i.e. how do they grant access, which queries do they run and how do they isolate queries ran in production from affecting normal production load.

ben851 commented 5 months ago

https://support.console.aws.amazon.com/support/home?region=ca-central-1#/case/?displayId=170558991301312&language=en

sastels commented 5 months ago

Ben to review AWS' response

sastels commented 5 months ago

AWS suggestions probably won't work to well for us (at least, won't work if we want to allow folks to do whatever queries they want).

sastels commented 5 months ago

Jimmy will reply to AWS's support ticket response.

jimleroyer commented 5 months ago

Replied to aws support with additional context and solutions we are considering. Let's wait and see.

jimleroyer commented 5 months ago

AWS support proposed to meet and we suggested for next Monday at 8h30, waiting for confirmation.

ben851 commented 5 months ago

Meeting confirmed w/ AWS for 8:30 EST on Feb 5

jimleroyer commented 4 months ago

Ben met with AWS and they redirected us to the Aurora team. They confirmed that the RDS proxy is limited for what we mean to do.

jimleroyer commented 4 months ago

AWS had to respond back to our ticket but there were no updates.

jimleroyer commented 4 months ago

Ben has requested for another meeting with AWS for next week.

ben851 commented 4 months ago

Meeting with AWS on Wednesday.

ben851 commented 4 months ago

Spoke with AWS yesterday. They recommended setting up logical replication manually. They also suggested that we speak with our AWS Solutions Architect that we have under our support plan with AWS.

ben851 commented 4 months ago

Need to estimate costs on manual logical replication.

Could also look into serverless.

jimleroyer commented 3 months ago

Jimmy to copy paste the solution that was proposed by the solution architect, if we want to protect our prod env from blazer.

jimleroyer commented 3 months ago

Moving to review as we need to talk about this into one of our code review session.

ben851 commented 2 months ago

Discussed this with the team, we are happy with the status quo. I will update the ADR to reflect our decisions.

jimleroyer commented 2 months ago

We have no ADR opened for this atm. We can make a short one, not much detailed but describing the options we had along with the cons/pros, and mention we selected the status quo in favor of not letting non-devs choose our blazer instance and rather use the quicksight option.

ben851 commented 2 months ago

ADR created and submitted for review