Possible to aquire an existing Cloud SQLInstance without service interruption

GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources

https://cloud.google.com/config-connector/docs/overview

Apache License 2.0

902 stars 234 forks source link

Possible to aquire an existing Cloud SQLInstance without service interruption #775

Open chadkouse opened 1 year ago

chadkouse commented 1 year ago

Describe your question

I used the output from gcloud sql instances describe {INSTANCE_NAME} to build a configuration for a cloud sql instance - making sure all of the settings were the same, the tier, the network, etc When I applied the config to my cluster it did aquire the resource but it's status in k8s was Updating and the google cloud console also showed it updating. The server actually even became unavailable for a few minutes.

I wasn't able to get information about why the resource was being updated.

I would like to perform this process again in our production environment but I am wondering if there is a zero-downtime version of this process?

diviner524 commented 1 year ago

@chadkouse CloudSQLInstance is a complex resource, the underlying GCP API can also be tricky in certain cases.

Some suggestions you can experiment with:

Instead of using a full configuration to acquire an existing resource, can you try with a "minimal" configuration first and then expand the configuration based on your need? By minimal, we are referring to only the required fields in the CRD schema. A minimal configuration means the user has no opinion on values of other existing fields which are not specified in the YAML, thus reducing the likelihood of causing an "Updating" event.
You can also consider specifying the annotation cnrm.cloud.google.com/state-into-spec as absent [1] in your CloudSQLInstance YAML, this will change the behavior of KCC controller and stop it from populating unspecified values back into K8s spec. This feature could be helpful when we are working with some non-standard API behaviors.

[1] https://cloud.google.com/config-connector/docs/concepts/ignore-unspecified-fields

chadkouse commented 1 year ago

@diviner524 Thanks for the info, I just tested this -- It looks like by using a minimal config it still set the instance to "updating" but only for around 30 seconds or so (see image). I didn't get a chance to test if connectivity was lost during that update. I'll try to test that and report back soon but this may be the answer.

Trying the state-into-spec to absent resulted in the following error message: kind 'SQLInstance' does not support having annotation 'cnrm.cloud.google.com/state-into-spec' set to value 'absent' so maybe that's not a viable option for SQLInstance

diviner524 commented 1 year ago

@chadkouse

If using the minimal config still gives you "Updating", it might be related to some specific fields/config in your YAML.
This feature has been supported in SQLInstance for a while (since v1.94.0). It sounds like you are using an old version of Config Connector, which also implies there might be bugs in the CloudSQLInstance resource that has already been fixed. Can you try with the latest version?

tsawada commented 3 months ago

We also noticed that our SQLInstance keeps getting updated every 10 minutes with KCC v1.120.1 and stateIntoSpec=absent. Among ~50 kinds of resources we use, we also see this behavior with StorageTransferJob.

Is there anything we can do to find out what field is causing this?

tsawada commented 4 weeks ago

This is still happening with v1.124.0. Is there anything we can do?

jasonvigil commented 3 weeks ago

This is still happening with v1.124.0. Is there anything we can do?

@tsawada can you please share the SQLInstance CR YAML you are using?

tsawada commented 3 weeks ago

Thanks @jasonvigil . Here's what we use:

apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
  name: postgres-db-1
  annotations:
    cnrm.cloud.google.com/deletion-policy: "abandon"
    cnrm.cloud.google.com/state-into-spec: "absent"
spec:
  databaseVersion: POSTGRES_13
  region: asia-northeast1
  settings:
    backupConfiguration:
      backupRetentionSettings:
        retainedBackups: 100
        retentionUnit: COUNT
      enabled: true
      pointInTimeRecoveryEnabled: true
      startTime: '0:00'
      transactionLogRetentionDays: 7
    databaseFlags:
      - name: cloudsql.iam_authentication
        value: "1"
    diskAutoresize: true
    diskType: PD_HDD
    deletionProtectionEnabled: true
    ipConfiguration:
      ipv4Enabled: false
      privateNetworkRef:
        name: "my-private-network"
      sslMode: "TRUSTED_CLIENT_CERTIFICATE_REQUIRED"
    tier: db-custom-1-3840

jasonvigil commented 2 weeks ago

Ok, there appears to be a combination of a few issues going on here @tsawada. I just made a fix for one of the code issues: https://github.com/GoogleCloudPlatform/k8s-config-connector/pull/3106.

However, there are a couple of issues with the YAML you posted.

The value for the cloudsql.iam_authentication database flag should be "on", not "1". Ref: https://cloud.google.com/sql/docs/postgres/flags#postgres-c
The ipConfiguration should specify requireSsl: true (because the instance type is postgres, and sslMode: "TRUSTED_CLIENT_CERTIFICATE_REQUIRED" is specified). Ref: https://cloud.google.com/sql/docs/mysql/admin-api/rest/v1/instances#ipconfiguration

After you make those updates, the fix above lands, and we enable the new version of the controller code (todo in a future release, perhaps v1.126), this issue should be resolved.

jasonvigil commented 2 weeks ago

@chadkouse, for the original issue, could you please also share the CR YAML you are using?

jasonvigil commented 1 week ago

@tsawada the fixes are now in.

First, you will need to upgrade to version 1.125 (https://github.com/GoogleCloudPlatform/k8s-config-connector/releases/tag/v1.125.0).
Then, update your YAML as suggested in https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/775#issuecomment-2458414898.
Lastly, you will need to enable the new direct controller for each of the SQLInstance resources following this doc: https://github.com/GoogleCloudPlatform/k8s-config-connector/blob/master/docs/features/optin.md.

At that point, the "constantly re-updating" issue should be fixed.

tsawada commented 1 week ago

@jasonvigil Thank you so much for fixing this quickly! I'll try next week and will get back to you if things went well.