Cloud SQL Read Replica fail

rutalreja-deloitte commented 3 years ago

Expected Behavior

2 Cloud SQL Instances with 'n' read replicas are created

Actual Behavior

2 Cloud SQL Instances with n-1 read replicas are created

Specifications

Only happens on the first run in the new blank project, if SQL instance is fixed manually (delete failed read replica and taint SQL random suffix) the issue does not come up again in the same project [multiple repeat tests - terraform destroy and terraform apply]

Workaround

Follow Troubleshooting steps in the 5-infrastructure README file

daniel-cit commented 3 years ago

There are two different situations here:

The timeout for the creation of the replicas is defined as 10 minutes in the Cloud SQL module: https://github.com/terraform-google-modules/terraform-google-sql-db/blob/51bb8bacc51ee227986a816a8cb1c72cb89eb5da/modules/postgresql/variables.tf#L247

variable "create_timeout" {
  description = "The optional timout that is applied to limit long database creates."
  type        = string
  default     = "10m"
}

variable "update_timeout" {
  description = "The optional timout that is applied to limit long database updates."
  type        = string
  default     = "10m"
}

variable "delete_timeout" {
  description = "The optional timout that is applied to limit long database deletes."
  type        = string
  default     = "10m"
}

With this configuration, sometimes the replica is created without any error but terraform has ended the connection with a timeout and when we try again it fails because the replica already exists.

This can be minimized by rising the timeout to 15 minutes. From the logs, we can see some of them completing in a little over 10 minutes.

The other case is when there is a failure in the creation of the replica. In this case, terraform does not receive a response from the API and it ends the call when the timeout limit is reached.

Investigating the logs in this case we can see that when we have the failure, there is the call to create the replica end only silence after that :

2021-05-07 00:51:08.140 BRT IAM cloudsql.googleapis.com cloudsql.instances.create instances/prj-bu1-d-boa-sql-f57b:boa-sql-2-d-us-east4-13d26ab8-replica-example0 boa-terraform-d-sa@prj-bu1-d-boa-sec-c988.iam.gserviceaccount.com audit_log, method: "cloudsql.instances.create", principal_email: "boa-terraform-d-sa@prj-bu1-d-boa-sec-c988.iam.gserviceaccount.com"  Notice

2021-05-07 00:51:08.739 BRT IAM cloudsql.googleapis.com cloudsql.instances.create instances/prj-bu1-d-boa-sql-f57b:boa-sql-2-d-us-east4-13d26ab8-replica-example0 boa-terraform-d-sa@prj-bu1-d-boa-sec-c988.iam.gserviceaccount.com audit_log, method: "cloudsql.instances.create", principal_email: "boa-terraform-d-sa@prj-bu1-d-boa-sec-c988.iam.gserviceaccount.com"

But when the replica is created we have the call to the creation of the replica and then we see many Postgress internal plumbing messages:

2021-05-07 18:30:08.738 BRT IAM cloudsql.googleapis.com cloudsql.instances.create instances/prj-bu1-d-boa-sql-f57b:boa-sql-1-d-us-east1-3ce36215-replica-example1 boa-terraform-d-sa@prj-bu1-d-boa-sec-c988.iam.gserviceaccount.com audit_log, method: "cloudsql.instances.create", principal_email: "boa-terraform-d-sa@prj-bu1-d-boa-sec-c988.iam.gserviceaccount.com"

2021-05-07 18:30:09.392 BRT IAM cloudsql.googleapis.com cloudsql.instances.create instances/prj-bu1-d-boa-sql-f57b:boa-sql-1-d-us-east1-3ce36215-replica-example1 boa-terraform-d-sa@prj-bu1-d-boa-sec-c988.iam.gserviceaccount.com audit_log, method: "cloudsql.instances.create", principal_email: "boa-terraform-d-sa@prj-bu1-d-boa-sec-c988.iam.gserviceaccount.com"

2021-05-07 21:34:18.243 UTC [1]: [1-1] db=,user= LOG: starting PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by Debian clang version 10.0.1 , 64-bit

2021-05-07 21:34:18.245 UTC [1]: [2-1] db=,user= LOG: listening on IPv4 address "0.0.0.0", port 5432

2021-05-07 21:34:18.245 UTC [1]: [3-1] db=,user= LOG: listening on IPv6 address "::", port 5432

2021-05-07 21:34:18.319 UTC [1]: [4-1] db=,user= LOG: listening on Unix socket "/pgsql/.s.PGSQL.5432"

2021-05-07 21:34:18.334 UTC [11]: [1-1] db=,user= LOG: database system was interrupted; last known up at 2021-05-07 21:29:58 UTC

2021-05-07 21:34:18.438 UTC [12]: [1-1] db=cloudsqladmin,user=cloudsqladmin FATAL: the database system is starting up

2021-05-07 21:34:18.441 UTC [13]: [1-1] db=cloudsqladmin,user=cloudsqladmin FATAL: the database system is starting up

2021-05-07 21:34:18.505 UTC [11]: [2-1] db=,user= LOG: entering standby mode

2021-05-07 21:34:18.596 UTC [11]: [3-1] db=,user= LOG: database system was not properly shut down; automatic recovery in progress

2021-05-07 21:34:18.600 UTC [11]: [4-1] db=,user= LOG: redo starts at 0/18D2398

2021-05-07 21:34:18.605 UTC [11]: [5-1] db=,user= LOG: invalid record length at 0/18DB798: wanted 24, got 0

2021-05-07 21:34:18.616 UTC [11]: [6-1] db=,user= LOG: consistent recovery state reached at 0/18DB798

2021-05-07 21:34:18.616 UTC [1]: [5-1] db=,user= LOG: database system is ready to accept read only connections

...

In this case, we also receive an already exist error when we try again to run Terraform and in the console, the replica is marked with a red X without an assigned IP, Indicating that it is in some kind of partial creation failure state.

rutalreja-deloitte commented 3 years ago

@daniel-cit the database takes 5 mins and then each read replica on the database take approx 6 mins, as we can only create one read replica at a time and only after the database is present, a database with 2 read replicas takes 5+6+6 = 17 mins. During testing, I did raise the create_timeout to 20 mins but it did not seem to make a difference but can try again since I think I now have more of an understanding of how terraform is working.

daniel-cit commented 3 years ago

Hi @rutalreja-deloitte I got slower responses from the API:

module.env.module.sql["sql2"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-2-p-us-east4-replica-example0"]: 
Creation complete after 7m17s [id=boa-sql-2-p-us-east4-4a188453-replica-example0]

module.env.module.sql["sql1"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-1-p-us-east1-replica-example0"]: 
Creation complete after 7m8s [id=boa-sql-1-p-us-east1-a40c460a-replica-example0]

module.env.module.sql["sql1"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-1-p-us-east1-replica-example1"]: 
Creation complete after 12m33s [id=boa-sql-1-p-us-east1-a40c460a-replica-example1]

module.env.module.sql["sql1"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-1-n-us-east1-replica-example1"]:
Creation complete after 5m6s [id=boa-sql-1-n-us-east1-a52faf1f-replica-example1]
module.env.module.sql["sql2"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-2-n-us-east4-replica-example1"]:
Creation complete after 5m34s [id=boa-sql-2-n-us-east4-38023544-replica-example1]

module.env.module.sql["sql1"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-1-n-us-east1-replica-example0"]:
Creation complete after 10m1s [id=boa-sql-1-n-us-east1-a52faf1f-replica-example0]
module.env.module.sql["sql2"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-2-n-us-east4-replica-example0"]:
Creation complete after 10m28s [id=boa-sql-2-n-us-east4-38023544-replica-example0]

It may depend in load issues in the region we are using.

I I think that the timeouts are for each call of the replica creation: https://github.com/terraform-google-modules/terraform-google-sql-db/blob/51bb8bacc51ee227986a816a8cb1c72cb89eb5da/modules/postgresql/read_replica.tf#L87

daniel-cit commented 3 years ago

This PR https://github.com/terraform-google-modules/terraform-google-sql-db/pull/216 increased the time out configuration from 10 minutes to 15 minutes.

In a new deploy the creation of the replicas were successfully created for the development environment using the new timeout.

module.env.module.sql["sql2"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-2-d-us-west1-replica-example1"]:
 Creation complete after 7m3s [id=boa-sql-2-d-us-west1-01ee9e59-replica-example1]

module.env.module.sql["sql1"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-1-d-us-east1-replica-example0"]:
 Creation complete after 8m34s [id=boa-sql-1-d-us-east1-fcd70ee2-replica-example0]

module.env.module.sql["sql2"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-2-d-us-west1-replica-example0"]:
 Creation complete after 11m48s [id=boa-sql-2-d-us-west1-01ee9e59-replica-example0]

module.env.module.sql["sql1"].module.boa_postgress_ha.google_sql_database_instance.replicas["boa-sql-1-d-us-east1-replica-example1"]:
 Creation complete after 13m49s [id=boa-sql-1-d-us-east1-fcd70ee2-replica-example1]

I think that we can keep the troubleshooting instruction in the README but I think the issue will be less frequent. @bharathkkb @rutalreja-deloitte

rutalreja-deloitte commented 3 years ago

@daniel-cit that's amazing, do we need to update the pinned version to a new release?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

GoogleCloudPlatform / terraform-example-foundation-app