GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
886 stars 216 forks source link

Existing RedisInstances failing to update state due hard requirement on readReplicaMode having value #622

Open n1koo opened 2 years ago

n1koo commented 2 years ago

Checklist

Bug Description

After upgrading to 1.74 from 1.61 all our old Redis instances started showing 'Update call failed: cannot make changes to immutable field(s): [readReplicasMode]'

THis issue was introduced in 1.71

This is due to the instance being created before this feature was a thing. It seems its NA for old instances and cannot be changed.

maqiuyujoyce commented 2 years ago

Hi @n1koo , thank you for reporting the issue! I tried to create a RedisInstance in Config Connector 1.61.0 and then upgraded the version to 1.74.0, but didn't encounter the error you described (see my yaml snippet and steps in the end of the comment). Could you share more details about this bug following the template below?

## Additional Diagnostic Information

### Kubernetes Cluster Version

<!--
Run the following command to get the Kubernetes cluster version:

kubectl version --short
-->

### Config Connector Version

<!--
Run the following command to get the installed Config Connector version:

kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}'
-->

### Config Connector Mode

<!--
Run the following command to get the Config Connector Mode:

kubectl get ConfigConnector "configconnector.core.cnrm.cloud.google.com" -o=jsonpath="{@.spec.mode}"
-->

## Log Output

<!--
Include any relevant log output. See
  https://cloud.google.com/config-connector/docs/troubleshooting#logging

If the issue is for a specific resource, please include those logs.
-->

## Steps to Reproduce

### Steps to reproduce the issue
<!--- Steps needed to reproduce the issue. --->

### YAML snippets
<!--- YAML snippets needed to reproduce the issue. See the following as an example. --->

```yaml
apiVersion: pubsub.cnrm.cloud.google.com/v1beta1
kind: PubSubTopic
metadata:
  labels:
    label-one: "value-one"
  name: pubsubtopic-sample

====== YAML and steps I tried ======
The YAML snippet I used:

apiVersion: redis.cnrm.cloud.google.com/v1beta1 kind: RedisInstance metadata: name: redisinstance-test namespace: [namespace] spec: displayName: Sample Redis Instance region: us-central1 tier: BASIC memorySizeGb: 1



Steps:
1. Installed Config Connector `1.61.0` in cluster mode.
2. Created a `RedisInstance` using the YAML snippet above and waited for for it to be ready.
3. Upgraded Config Connector to `1.74.0`.
4. Run `kubectl describe` to check the status of the RedisInstance and it is `UpToDate`.
n1koo commented 2 years ago

@maqiuyujoyce the problem exists with older redis instances. In our case they were atleast 270days old.

Otherwise your steps are correct, but you require an older instance to reproduce. I don't think going through the template has much value here as its not a k8s issue but just an issue how those values are in GCP backend

For example comparing a new and old instance via gcloud:

Old (as in older than 9months)

‽ gcloud redis instances describe <new-instance>  --region <region> --project <project>|grep read
(nothing matched)

In console showing Read Replicas: NA

New(as in newer than 9months)

‽ gcloud redis instances describe <new-instance>  --region <region> --project <project>|grep read
readReplicasMode: READ_REPLICAS_DISABLED
maqiuyujoyce commented 2 years ago

Hi @n1koo , thanks for sharing the context.

Have you tried to abandon the RedisInstance and acquire it with readReplicasMode explicitly set to NA?

n1koo commented 2 years ago

Sorry @maqiuyujoyce not yet. We reverted the upgrade for now. Doing that dance for our instances would be somewhat painful though as its not just a handful.

We could write that automation, but I was a little bit hoping config-connector could handle it better internally?

This change is not mentioned even as breaking change

jcanseco commented 2 years ago

Hi @n1koo, this is indeed strange behavior.

Can you get the state of your Redis instances by hitting the API directly? Use "Try It!" here: https://cloud.google.com/memorystore/docs/redis/reference/rest/v1/projects.locations.instances/get

I'm particularly interested in the readReplicasMode value returned by the API.

Also: can you verify that your RedisInstance resources don't specify a readReplicasMode after the KCC upgrade?

jcanseco commented 2 years ago

It may be worth noting that "NA" is not a valid readReplicasMode value, but this should be irrelevant as long as there is no diff between the resource configuration that KCC tries to apply and the resource state that KCC reads from the API.

As far as I can tell, we are not modifying the readReplicasMode value we read back from the API at all even if the value is technically invalid.

Unfortunately, this is also going to be a bit challenging to debug for us since we do not have "old" Redis instances to work with.

jcanseco commented 2 years ago

Ok, I suspect the API is actually returning readReplicasMode: "" for "old" Redis instances -- can you confirm this @n1koo?

n1koo commented 2 years ago

@jcanseco sorry, i dont have access to that due to corp policies but just checking it via google-cloud-redis and eg simple script

credentials, project = google.auth.default()
client = redis_v1.CloudRedisClient(credentials=credentials)
project_id = "xxx"
location = "us-east1"
memorystore_instance_name = "xxx"

parent = f"projects/{project_id}/{location}"

memorystore_instance_path = client.instance_path(
    project_id, location, memorystore_instance_name)
try:
    memorystore_instance = client.get_instance(
        request={'name': memorystore_instance_path})
    log.info(memorystore_instance)

the whole entry for read_replicas_mode was missing (not just value being empty or nil)

eg. a 2 year old redis:

display_name: "xxx"
location_id: "us-east1-d"
redis_version: "REDIS_4_0"
reserved_ip_range: "xxx"
host: "xxx"
port: 6379
current_location_id: "us-east1-d"
create_time {
  seconds: 1585590006
  nanos: 241136792
}
state: READY
tier: BASIC
memory_size_gb: 1
authorized_network: "xxx"
persistence_iam_identity: "xxx"
connect_mode: DIRECT_PEERING
transit_encryption_mode: DISABLED
maintenance_policy {
  create_time {
    seconds: 1628336720
    nanos: 787160000
  }
  update_time {
    seconds: 1628336720
    nanos: 787160000
  }
  weekly_maintenance_window {
    day: SATURDAY
    start_time {
      hours: 5
    }
    duration {
      seconds: 3600
    }
  }
}
nodes {
  id: "node-0"
  zone: "us-east1-d"
}

vs a recent'ish redis (2months old)

display_name: "xxx"
location_id: "us-east1-c"
redis_version: "REDIS_4_0"
reserved_ip_range: "xxx"
host: "xxx"
port: 6379
current_location_id: "us-east1-c"
create_time {
  seconds: 1643979721
  nanos: 613829957
}
state: READY
tier: BASIC
memory_size_gb: 1
authorized_network: "xxx"
persistence_iam_identity: "xxx"
connect_mode: DIRECT_PEERING
transit_encryption_mode: DISABLED
maintenance_policy {
  create_time {
    seconds: 1643979932
    nanos: 571797000
  }
  update_time {
    seconds: 1643979932
    nanos: 571797000
  }
  weekly_maintenance_window {
    day: SATURDAY
    start_time {
      hours: 5
    }
    duration {
      seconds: 3600
    }
  }
}
nodes {
  id: "node-0"
  zone: "us-east1-c"
}
read_replicas_mode: READ_REPLICAS_DISABLED
jcanseco commented 2 years ago

Hi @n1koo. Gotcha, I'll check in with the Redis team internally, but if you can, it would really help us if you can confirm what value for readReplicasMode is returned by the API. Checking what a client library returns is insufficient unfortunately since they don't necessarily fully reflect what the API actually returns.

jcanseco commented 2 years ago

Hi @n1koo, can you please report this to GCP support so that the Redis team can take a look?

I chatted with the Redis team internally and it seems that old Redis instances should still also have a value for readReplicasMode (READ_REPLICAS_DISABLED).

If you are seeing that your instances are not reporting a value for readReplicasMode (even if it is through google-cloud-redis), I think that might be worth reporting since that appears to be unintended behavior.

jcanseco commented 2 years ago

Nevermind! It seems Terraform users have also reported facing this issue: https://github.com/hashicorp/terraform-provider-google/issues/11410

And Terraform team is able to reproduce the issue as they have an old Redis instance that predates the read replicas feature. Let me work with the Terraform and Redis teams internally to see what we can do.

n1koo commented 2 years ago

Awesome thank you! I also finally got the access so heres the raw api response;

{
  "name": "projects/xxx/locations/xxx/instances/xxx",
  "displayName": "xx",
  "labels": {
    //xx
  },
  "locationId": "us-east1-b",
  "redisVersion": "REDIS_4_0",
  "reservedIpRange": "xxx",
  "host": "xxx",
  "port": 6379,
  "currentLocationId": "us-east1-b",
  "createTime": "2020-04-22T19:51:04.021313435Z",
  "state": "READY",
  "tier": "BASIC",
  "memorySizeGb": 8,
  "authorizedNetwork": "projects/xxx/global/networks/xxx",
  "persistenceIamIdentity": "serviceAccount:xxx-compute@developer.gserviceaccount.com",
  "connectMode": "DIRECT_PEERING",
  "transitEncryptionMode": "DISABLED",
  "maintenancePolicy": {
    "createTime": "2021-08-07T11:01:09.836707Z",
    "updateTime": "2021-08-07T11:01:09.836707Z",
    "weeklyMaintenanceWindow": [
      {
        "day": "SATURDAY",
        "startTime": {
          "hours": 5
        },
        "duration": "3600s"
      }
    ]
  },
  "nodes": [
    {
      "id": "node-0",
      "zone": "us-east1-b"
    }
  ]
}

Seems this is already patched in terraform-provider so 🤞 fix will land soon to your dependency

jcanseco commented 2 years ago

Yep that is correct. We'll post an update once this has been fixed in KCC :)

Veres72 commented 2 years ago

We faced with the same problem. JFYI: Our work around without rolling back config connector

  1. Add annotation cnrm.cloud.google.com/deletion-policy: abandon to redisinstnace k8s object
  2. Delete redisinstnace kubernetes object
  3. Add spec.readReplicasMode: "" to yaml
  4. Deploy redisinstnace

Not sure that it's the best solution and how it will be behave in future after bug will be resolved.

jcanseco commented 2 years ago

@Veres72 thank you for sharing! Yes that workaround makes sense and is what we recommend for now.

We are looking to release a new Config Connector version with a fix which we intend to also roll out to GKE 1.23+ clusters using the KCC add-on.

We are also asking the Redis team to implement a server-side fix to accommodate users who may not be able to upgrade for whatever reason, but we have not received updates from them yet.