GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
861 stars 202 forks source link

Issues with ComputeInstance network interface immutability and thrashing #293

Open Scorpiion opened 3 years ago

Scorpiion commented 3 years ago

Describe the bug We have two problems with config connector right now:

Status:
  Conditions:
    Last Transition Time:  2020-09-23T03:26:45Z
    Message:               Update call failed: the desired mutation for the following field(s) is invalid: [networkInterface.0.NetworkIp bootDisk.0.InitializeParams.0.Image]
    Reason:                UpdateFailed
    Status:                False
    Type:                  Ready
  Cpu Platform:            Intel Skylake
  Current Status:          RUNNING
  Instance Id:             xxxxxxxxxxx
  Label Fingerprint:       xxxxxxxxxxx
  Metadata Fingerprint:    xxxxxxxxxxx
  Self Link:               https://www.googleapis.com/compute/v1/projects/xxxxxxxxxxx/zones/europe-north1-b/instances/xxxxxxxxxxx
  Tags Fingerprint:        WP69uztoeGw=
Events:
  Type     Reason        Age                    From                        Message
  ----     ------        ----                   ----                        -------
  Warning  UpdateFailed  37m (x643 over 15d)    computeinstance-controller  Update call failed: the desired mutation for the following field(s) is invalid: [bootDisk.0.InitializeParams.0.Image networkInterface.0.NetworkIp]
  Warning  UpdateFailed  3m47s (x675 over 15d)  computeinstance-controller  Update call failed: the desired mutation for the following field(s) is invalid: [networkInterface.0.NetworkIp bootDisk.0.InitializeParams.0.Image]

The VM got setup correctly when the yaml was initially applied, but then config connector started giving these errors and now we are in a state where we can't do updates and this is a quite big blocker right now (some things I have manually updated in the UI and "backported" to the yaml, or the other way around, but it's an accident waiting to happen). I have had this for about a month now, thinking that a GKE addon update would come soon that might resolve this but so far no luck. Initially we used the "stable" GKE release channel, but I updated us to use the "regular" channel so we should be able to use the GKE addon (previously had a script based automation of config connector install/update). Now we seem to be "stuck" with config connector version 1.15.1 that was released in 2020-03-19 (https://github.com/GoogleCloudPlatform/k8s-config-connector/releases/tag/1.5.1), that is more than 7 months ago. And to be honest, with the high development pace of config connector that is ages...

One of our problems above with networkInterface.0.NetworkIp I think is resolved in 1.6.1 as described here. But yeah, even if it was fixed in April I don't seem to be able to get this update with the addon.

The second error above about the bootdisk, it seemed when I tested earlier that only happened when using the cos-cloud/cos-81-lts value as bootDisk.initializeParams.sourceImageRef.external, when I used the debian family of image the error did not appear I think (but it was some time ago I tested, I'm 95% it did not appear with debian). So I've been thinking that this maybe has also been resolved already in newer versions of config connector but not completely sure.

ConfigConnector Version

# GKE, regular channel
Master nodes: 1.17.9-gke.1504
Compute nodes: 1.17.9-gke.1504

# GKE Addon, regular channel
1.15.1

To Reproduce Install GKE with regular channel and config connector addon. Check that version is 1.15.1. I don't think reproducing the error I got make sense time wise at the moment, I think it's more important to update the config connector addon so I can check with a later version.

Questions

  1. Should GKE regular channel be on 1.15.1 or have my cluster gotten behind on updates?
  2. If yes above, can you consider looking over this release schedule, when config connector is more stable / less updated the current schedule might make sense but now it's way too many "must have" features/fixes in new releases for the addon to be this far behind I think. I work as a consultant and think config connector is a good solution for our client, however it gets hard to "sell" config connector with these type of problems (main alternative solution would be terraform, but I'd prefer not to switch). Putting the whole cluster on the rapid channel is not a good solution as it put their production critical services on an unnecessary aggressive/risky kubernetes upgrade path.
  3. Is the GKE version <-> Config connector addon release scheduled documented somewhere? I have not found information on this and it would have helped when debugging this.
  4. Do you know of any workarounds for me until this is resolved? I have tried to delete the config connector entry (with cnrm.cloud.google.com/deletion-policy: abandon) and then recreate it but it gives the same error and no updates of for example the metadata user-data value that I want to update.

If I have missed some information above please just ask and I will try to provide it ASAP.

kibbles-n-bytes commented 3 years ago

Hey @Scorpiion , creating a new GKE cluster on the regular channel will today install version 1.23.0 of Config Connector. Add-on updates are triggered by every master node upgrade, which on the regular channel should have certainly happened between when it looks like the cluster was installed and now.

Could you share the gcloud container clusters describe output for that particular cluster? And in addition, could you share the output of the following call:

kubectl get po -n configconnector-operator-system configconnector-operator-0 -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/operator-version}'

We do unfortunately have a time lag between our standalone release and our add-on release on the order of ~3 weeks, which we are looking to reduce in the future. However, the lag you're describing is way too long, so this must be some sort of issue either with your cluster configuration or internally on the Google API.

You described two problems in this thread, though; the title is about the lag being too long, but the VM problem seems separate. Could you confirm on a separate cluster on 1.27.0 if the issue still remains? Or, let's wait until we understand your add-on upgrade issue, and then see if it is still an issue at that point in time.

kibbles-n-bytes commented 3 years ago

To your questions #3 and #4:

  1. We currently unfortunately don't document the exact version relationship between GKE masters and Config Connector versions, but intend to add that information. And we hope in H12021 that the add-on upgrade will track so closely with the standalone release so as to not be as crucial.
  2. We'll need to investigate the VM issue separately in order to understand exactly what's going on, which would need reproduction config. For now, the only thing we can offer to do is abandon in order to not have the errors in the API server or the unnecessary read calls against GCP.

And sorry for the friction. We really appreciate your investment in Config Connector and want to ensure we are a good fit for your (and your clients') use cases.

Scorpiion commented 3 years ago

Hi @kibbles-n-bytes, thanks for the quick reply. Maybe I should add also that I have two clusters (dev/prod) that both are in the same state with old config connector version. Both of these clusters had a manual install of config connector earlier (I had an earlier issue related to this that can be seem here: #287).

Here is the output from gcloud container clusters describe:

addonsConfig:
  configConnectorConfig:
    enabled: true
  httpLoadBalancing: {}
  kubernetesDashboard:
    disabled: true
  networkPolicyConfig: {}
autoscaling: {}
binaryAuthorization: {}
clusterIpv4Cidr: 172.24.0.0/18
createTime: '2020-04-16T12:56:39+00:00'
currentMasterVersion: 1.17.9-gke.1504
currentNodeCount: 6
currentNodeVersion: 1.17.9-gke.1504
databaseEncryption:
  state: DECRYPTED
defaultMaxPodsConstraint:
  maxPodsPerNode: '110'
endpoint: 35.228.133.114
initialClusterVersion: 1.14.10-gke.27
initialNodeCount: 1
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-b/instanceGroupManagers/gke-PROJECT_NAME-default-pool-33323955-grp
- https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-c/instanceGroupManagers/gke-PROJECT_NAME-default-pool-1d45531e-grp
- https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-a/instanceGroupManagers/gke-PROJECT_NAME-default-pool-30bad168-grp
- https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-b/instanceGroupManagers/gke-PROJECT_NAME--shared-gvisor-no-22d9f3e7-grp
- https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-c/instanceGroupManagers/gke-PROJECT_NAME--shared-gvisor-no-e9ccd380-grp
- https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-a/instanceGroupManagers/gke-PROJECT_NAME--shared-gvisor-no-826aeaf0-grp
ipAllocationPolicy:
  clusterIpv4Cidr: 172.24.0.0/18
  clusterIpv4CidrBlock: 172.24.0.0/18
  clusterSecondaryRangeName: vnet-172-24-0-0-18-pod-range
  servicesIpv4Cidr: 172.24.192.0/20
  servicesIpv4CidrBlock: 172.24.192.0/20
  servicesSecondaryRangeName: vnet-172-24-192-0-20-service-range
  useIpAliases: true
labelFingerprint: 9cc782fd
legacyAbac: {}
location: europe-north1
locations:
- europe-north1-b
- europe-north1-c
- europe-north1-a
loggingService: logging.googleapis.com/kubernetes
maintenancePolicy:
  resourceVersion: e3b0c442
masterAuth:
  clusterCaCertificate: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
masterAuthorizedNetworksConfig:
  cidrBlocks:
  - cidrBlock: xxxxxxxxxx
    displayName: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  - cidrBlock: xxxxxxxxxx
    displayName: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  - cidrBlock: xxxxxxxxxx
    displayName: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  - cidrBlock: xxxxxxxxxx
    displayName: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  enabled: true
monitoringService: monitoring.googleapis.com/kubernetes
name: CLUSTER_NAME
network: NETWORK_NAME
networkConfig:
  defaultSnatStatus: {}
  enableIntraNodeVisibility: true
  network: projects/NETWORK_NAME/global/networks/NETWORK_NAME
  subnetwork: projects/NETWORK_NAME/regions/europe-north1/subnetworks/vnet-172-24-240-0-26-kubenet
networkPolicy:
  enabled: true
nodeConfig:
  diskSizeGb: 100
  diskType: pd-standard
  imageType: COS
  machineType: n1-standard-2
  metadata:
    disable-legacy-endpoints: 'true'
  oauthScopes:
  - https://www.googleapis.com/auth/monitoring
  - https://www.googleapis.com/auth/devstorage.read_only
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/service.management.readonly
  - https://www.googleapis.com/auth/servicecontrol
  - https://www.googleapis.com/auth/trace.append
  serviceAccount: default
  shieldedInstanceConfig:
    enableIntegrityMonitoring: true
  workloadMetadataConfig:
    mode: GKE_METADATA
nodePools:
- config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS
    machineType: n1-standard-2
    metadata:
      disable-legacy-endpoints: 'true'
    oauthScopes:
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/service.management.readonly
    - https://www.googleapis.com/auth/servicecontrol
    - https://www.googleapis.com/auth/trace.append
    serviceAccount: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
    workloadMetadataConfig:
      mode: GKE_METADATA
  initialNodeCount: 1
  instanceGroupUrls:
  - https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-b/instanceGroupManagers/gke-PROJECT_NAME-default-pool-33323955-grp
  - https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-c/instanceGroupManagers/gke-PROJECT_NAME-default-pool-1d45531e-grp
  - https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-a/instanceGroupManagers/gke-PROJECT_NAME-default-pool-30bad168-grp
  locations:
  - europe-north1-b
  - europe-north1-c
  - europe-north1-a
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsConstraint:
    maxPodsPerNode: '110'
  name: default-pool
  podIpv4CidrSize: 24
  selfLink: https://container.googleapis.com/v1/projects/PROJECT_ID/locations/europe-north1/clusters/CLUSTER_NAME/nodePools/default-pool
  status: RUNNING
  upgradeSettings:
    maxSurge: 1
  version: 1.17.9-gke.1504
- config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS_CONTAINERD
    labels:
      sandbox.gke.io/runtime: gvisor
    machineType: e2-standard-4
    metadata:
      disable-legacy-endpoints: 'true'
    oauthScopes:
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/service.management.readonly
    - https://www.googleapis.com/auth/servicecontrol
    - https://www.googleapis.com/auth/trace.append
    sandboxConfig:
      type: GVISOR
    serviceAccount: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
    taints:
    - effect: NO_SCHEDULE
      key: sandbox.gke.io/runtime
      value: gvisor
    workloadMetadataConfig:
      mode: GKE_METADATA
  initialNodeCount: 1
  instanceGroupUrls:
  - https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-b/instanceGroupManagers/gke-PROJECT_NAME--shared-gvisor-no-22d9f3e7-grp
  - https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-c/instanceGroupManagers/gke-PROJECT_NAME--shared-gvisor-no-e9ccd380-grp
  - https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-north1-a/instanceGroupManagers/gke-PROJECT_NAME--shared-gvisor-no-826aeaf0-grp
  locations:
  - europe-north1-b
  - europe-north1-c
  - europe-north1-a
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsConstraint:
    maxPodsPerNode: '110'
  name: shared-gvisor-node-pool-1
  podIpv4CidrSize: 24
  selfLink: https://container.googleapis.com/v1/projects/PROJECT_ID/locations/europe-north1/clusters/CLUSTER_NAME/nodePools/shared-gvisor-node-pool-1
  status: RUNNING
  upgradeSettings:
    maxSurge: 1
  version: 1.17.9-gke.1504
privateClusterConfig:
  enablePrivateNodes: true
  masterIpv4CidrBlock: 172.24.240.192/28
  peeringName: gke-nda9853dcb96357d4233-8a67-dd27-peer
  privateEndpoint: 172.24.240.194
  publicEndpoint: 35.228.133.114
releaseChannel:
  channel: REGULAR
resourceLabels:
  cnrm-lease-expiration: '1603438961'
  cnrm-lease-holder-id: btl7v7gqo9cmjt3dh2s0
  managed-by-cnrm: 'true'
selfLink: https://container.googleapis.com/v1/projects/PROJECT_ID/locations/europe-north1/clusters/CLUSTER_NAME
servicesIpv4Cidr: 172.24.192.0/20
shieldedNodes: {}
status: RUNNING
subnetwork: vnet-172-24-240-0-26-kubenet
workloadIdentityConfig:
  workloadPool: PROJECT_ID.svc.id.goog
zone: europe-north1

The output of that command was also mentioned in my initial post, but I'll restate it here:

kubectl get po -n configconnector-operator-system configconnector-operator-0 -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/operator-version}'

1.15.1

Regarding triggering of updates, are they triggered both by manual upgrades (clicking the upgrade button in the UI) and by automated upgrades? I have upgraded a bit manually because of announced kubernetes vulnerabilities (this specifically: https://cloud.google.com/kubernetes-engine/docs/security-bulletins#gcp-2020-012)

Scorpiion commented 3 years ago

Hi again @kibbles-n-bytes, I started to reproduce this on a brand new cluster, and I still get 1.15.1 as the installed config connector version.

Steps to reproduce:

  1. Create new empty project (optional)
  2. Create new cluster using defaults, select Regular channel and enable workload identity and config connector
  3. Wait...
  4. Connect to cluster
  5. Run
    kubectl get po -n configconnector-operator-system configconnector-operator-0 -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/operator-version}'

I still get 1.15.1 when doing this. So maybe this is the core problem here then... I'll wait trying to recreate the VM config when we have solved this config connector version issue.

Here is also full output gcloud container clusters describe from this new test cluster:

gcloud container clusters describe --project=config-connector-debug-1 --zone=europe-north1-a cluster-1
addonsConfig:
  configConnectorConfig:
    enabled: true
  dnsCacheConfig: {}
  horizontalPodAutoscaling: {}
  httpLoadBalancing: {}
  kubernetesDashboard:
    disabled: true
  networkPolicyConfig:
    disabled: true
authenticatorGroupsConfig: {}
autoscaling: {}
clusterIpv4Cidr: 10.0.0.0/14
createTime: '2020-10-23T07:38:20+00:00'
currentMasterVersion: 1.17.9-gke.1504
currentNodeCount: 3
currentNodeVersion: 1.17.9-gke.1504
databaseEncryption:
  state: DECRYPTED
defaultMaxPodsConstraint:
  maxPodsPerNode: '110'
endpoint: 35.228.1.253
initialClusterVersion: 1.17.9-gke.1504
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/config-connector-debug-1/zones/europe-north1-a/instanceGroupManagers/gke-cluster-1-default-pool-15c97236-grp
ipAllocationPolicy:
  clusterIpv4Cidr: 10.0.0.0/14
  clusterIpv4CidrBlock: 10.0.0.0/14
  clusterSecondaryRangeName: gke-cluster-1-pods-2f860077
  servicesIpv4Cidr: 10.4.0.0/20
  servicesIpv4CidrBlock: 10.4.0.0/20
  servicesSecondaryRangeName: gke-cluster-1-services-2f860077
  useIpAliases: true
labelFingerprint: a9dc16a7
legacyAbac: {}
location: europe-north1-a
locations:
- europe-north1-a
loggingService: logging.googleapis.com/kubernetes
maintenancePolicy:
  resourceVersion: e3b0c442
masterAuth:
  clusterCaCertificate: XXXXXXXXXXXXXXXXXXXXXXXXXXX
masterAuthorizedNetworksConfig: {}
monitoringService: monitoring.googleapis.com/kubernetes
name: cluster-1
network: default
networkConfig:
  network: projects/config-connector-debug-1/global/networks/default
  subnetwork: projects/config-connector-debug-1/regions/europe-north1/subnetworks/default
networkPolicy: {}
nodeConfig:
  diskSizeGb: 100
  diskType: pd-standard
  imageType: COS
  machineType: e2-medium
  metadata:
    disable-legacy-endpoints: 'true'
  oauthScopes:
  - https://www.googleapis.com/auth/devstorage.read_only
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/monitoring
  - https://www.googleapis.com/auth/servicecontrol
  - https://www.googleapis.com/auth/service.management.readonly
  - https://www.googleapis.com/auth/trace.append
  serviceAccount: default
  shieldedInstanceConfig:
    enableIntegrityMonitoring: true
  workloadMetadataConfig:
    mode: GKE_METADATA
nodePools:
- autoscaling: {}
  config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS
    machineType: e2-medium
    metadata:
      disable-legacy-endpoints: 'true'
    oauthScopes:
    - https://www.googleapis.com/auth/devstorage.read_only
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/servicecontrol
    - https://www.googleapis.com/auth/service.management.readonly
    - https://www.googleapis.com/auth/trace.append
    serviceAccount: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
    workloadMetadataConfig:
      mode: GKE_METADATA
  initialNodeCount: 3
  instanceGroupUrls:
  - https://www.googleapis.com/compute/v1/projects/config-connector-debug-1/zones/europe-north1-a/instanceGroupManagers/gke-cluster-1-default-pool-15c97236-grp
  locations:
  - europe-north1-a
  management:
    autoRepair: true
    autoUpgrade: true
  maxPodsConstraint:
    maxPodsPerNode: '110'
  name: default-pool
  podIpv4CidrSize: 24
  selfLink: https://container.googleapis.com/v1/projects/config-connector-debug-1/zones/europe-north1-a/clusters/cluster-1/nodePools/default-pool
  status: RUNNING
  upgradeSettings:
    maxSurge: 1
  version: 1.17.9-gke.1504
releaseChannel:
  channel: REGULAR
selfLink: https://container.googleapis.com/v1/projects/config-connector-debug-1/zones/europe-north1-a/clusters/cluster-1
servicesIpv4Cidr: 10.4.0.0/20
shieldedNodes: {}
status: RUNNING
subnetwork: default
workloadIdentityConfig:
  workloadPool: config-connector-debug-1.svc.id.goog
zone: europe-north1-a
kibbles-n-bytes commented 3 years ago

Hey @Scorpiion , I checked our GKE<->Config Connector version association and the version of k8s you are current at, 1.17.9-gke.1504, is in fact intended to be on Config Connector 1.15.0. However, this is because this GKE master version is quite out of date at this point. I attempted to emulate your environment but am only able to get 1.17.12-gke.1504 as the default version for the regular release channel, which is on Config Connector 1.23.0 (the intended regular channel version). I checked our telemetry and can confirm your seems to be the only cluster that through the add-on has Config Connector at 1.15.0, and 1.19.0 is the next lowest number.

This doesn't seem to be a Config Connector issue; this is a generic question for the default GKE master version on the regular channel in your environment. As a mitigation, can you attempt to manually trigger a master upgrade to 1.17.12-gke.1504? And then follow up with GKE support as to why your master version is so outdated?

Scorpiion commented 3 years ago

Hi @kibbles-n-bytes and sorry for the slow reply, I was sick last week but are back to work now.

After your comment here a new GKE version became available for the Regular channel for me. In my post above I created two brand new clusters and they both got the old GKE master versions. Now I have newer GKE master and hence also newer config connector addon, so that is solved. I now have 1.23.0.

Now on to my issues, they did not go away. I managed however to edit the yaml files so I now have no errors as a workaround. I still think these are bugs worth fixing though. I repeat the core error message here:

Update call failed: the desired mutation for the following field(s) is invalid: [networkInterface.0.NetworkIp bootDisk.0.InitializeParams.0.Image]

Issue number 1, networkInterface.0.NetworkIp

When I created the VM I used an external reference to an internal ip that I had created/reserved. That worked and it got the correct ip, but then it stops working (external reference don't work after creation). If I however replaced it with the hardcoded ip later the error goes away.

So this worked on creation (leaving out other fields):

    networkInterface:
      networkIp: https://www.googleapis.com/compute/v1/projects/xxxxxxx/regions/europe-north1/addresses/xxxxx-internal-ip

But after VM creation it stops working and gives the error above, when replacing it like this the error goes away:

    networkInterface:
      networkIp: 172.22.0.2

I think this is bug with external reference for networkIp, it does not resolve the external value into an ip somehow, seems like it compared the external url with the actual ip or something along those lines maybe?

Issue number 2, bootDisk.0.InitializeParams.0.Image

When I created the VM I used the lts version of Google's Container-Optimized OS. I refereed to it like this:

(leaving out other fields):

    bootDisk:
      initializeParams:
        sourceImageRef:
          external: cos-cloud/cos-81-lts

It worked, it created the VM with correct image, however on updates it fails and complains with the error above.

If I after VM creation replace the value with cos-cloud/cos-81-12871-1196-0 then the error goes away (I got that value from looking in the GCP console). I don't remember if I deleted the config connector entry in between or just updated the value, I might have deleted the config connector object in between (with cnrm.cloud.google.com/deletion-policy: abandon so the VM stayed).

The error goes away if I replace cos-cloud/cos-81-lts with cos-cloud/cos-81-12871-1196-0

    bootDisk:
      initializeParams:
        sourceImageRef:
          external: cos-cloud/cos-81-12871-1196-0

I'm thinking this is also a bug to fix, it should either not work at all with external: cos-cloud/cos-81-lts or it should work fully I think. Or what do you guys think? =)

Scorpiion commented 3 years ago

Update, VM is now in restart loop...

The networkIp workaround did not actually work. It's works for one cycle then config connector thinks something has changed and restarts the whole VM, and it continues doing that forever basically. Luckily I tried this on a development server first. I get this from the activity logs in the GCP console:

This pattern repeats about every 10 minutes continuously

Completed: Start VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com started VM app-name-mysql
Start VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com started VM app-name-mysql
Completed: Set machine type on VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com set machine type on VM app-name-mysql
Set machine type on VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com set machine type on VM app-name-mysql
Completed: Stop VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com stopped VM app-name-mysql
Stop VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com stopped VM app-name-mysql
Completed: Add access config to VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com added access config to VM app-name-mysql
Add access config to VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com added access config to VM app-name-mysql
Completed: Delete access config from VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com deleted access config from VM app-name-mysql
Delete access config from VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com deleted access config from VM app-name-mysql
Update bucket svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com updated pro-mehi-project-name-mysql-backups

Completed: Start VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com started VM app-name-mysql
Start VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com started VM app-name-mysql
Completed: Set machine type on VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com set machine type on VM app-name-mysql
Set machine type on VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com set machine type on VM app-name-mysql
Completed: Stop VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com stopped VM app-name-mysql
Stop VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com stopped VM app-name-mysql
Completed: Add access config to VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com added access config to VM app-name-mysql
Add access config to VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com added access config to VM app-name-mysql
Completed: Delete access config from VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com deleted access config from VM app-name-mysql
Delete access config from VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com deleted access config from VM app-name-mysql
Completed: beta.compute.instances.setLabels svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com has executed beta.compute.instances.setLabels on app-name-mysql
beta.compute.instances.setLabels svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com has executed beta.compute.instances.setLabels on app-name-mysql
beta.compute.addresses.setLabels svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com has executed beta.compute.addresses.setLabels on app-name-db-internal-ip
beta.compute.addresses.setLabels svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com has executed beta.compute.addresses.setLabels on app-name-db-external-ip
Set labels of disk svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com set labels of disk mysql-data

Completed: Start VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com started VM app-name-mysql
Start VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com started VM app-name-mysql
Completed: Set machine type on VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com set machine type on VM app-name-mysql
Set machine type on VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com set machine type on VM app-name-mysql
Completed: Stop VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com stopped VM app-name-mysql
Stop VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com stopped VM app-name-mysql
Completed: Add access config to VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com added access config to VM app-name-mysql
Add access config to VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com added access config to VM app-name-mysql
Completed: Delete access config from VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com deleted access config from VM app-name-mysql
Delete access config from VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com deleted access config from VM app-name-mysql
Update bucket svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com updated pro-mehi-project-name-mysql-backups
beta.compute.addresses.setLabels svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com has executed beta.compute.addresses.setLabels on app-name-db-internal-ip

Completed: Start VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com started VM app-name-mysql
Start VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com started VM app-name-mysql
....

These rows caught my attention:

Completed: Add access config to VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com added access config to VM app-name-mysql
Add access config to VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com added access config to VM app-name-mysql
Completed: Delete access config from VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com deleted access config from VM app-name-mysql
Delete access config from VM svc-cnrm-project-name@PROJECT_ID.iam.gserviceaccount.com deleted access config from VM app-name-mysql

I think it is the same as this CLI command: https://cloud.google.com/sdk/gcloud/reference/compute/instances/delete-access-config

So, it's about deleting and adding network config, sounds related to the networkIp setting that I changed in the workaround above...

If I try to change back to use the external url format then I get this same error again. This config:

    networkIp: https://www.googleapis.com/compute/v1/projects/xxxxxxxx/regions/europe-north1/addresses/xxxxx-internal-ip

gives this error:

  status:
    conditions:
    - lastTransitionTime: "2020-11-02T19:28:57Z"
      message: 'Update call failed: the desired mutation for the following field(s)
        is invalid: [networkInterface.0.NetworkIp]'
      reason: UpdateFailed
      status: "False"
      type: Ready

Something that might be related to this is that I have this network setup:

The internal ip's that I have problems with look like this (gcloud json output) (note both PROJECT_A and PROJECT_B references):

  {
    "address": "172.22.0.2",
    "addressType": "INTERNAL",
    "creationTimestamp": "2020-09-22T19:19:30.555-07:00",
    "description": "Static internal ip",
    "id": "xxxxxxxxx",
    "kind": "compute#address",
    "name": "xxxxxxxxx-internal-ip",
    "networkTier": "PREMIUM",
    "purpose": "GCE_ENDPOINT",
    "region": "https://www.googleapis.com/compute/v1/projects/PROJECT_B/regions/europe-north1",
    "selfLink": "https://www.googleapis.com/compute/v1/projects/PROJECT_B/regions/europe-north1/addresses/xxxxxxxxx-internal-ip",
    "status": "IN_USE",
    "subnetwork": "https://www.googleapis.com/compute/v1/projects/PROJECT_A/regions/europe-north1/subnetworks/vnet-172-22-0-0-22-xxxxxxxxx",
    "users": [
      "https://www.googleapis.com/compute/v1/projects/PROJECT_B/zones/europe-north1-b/instances/xxxxxxxxx"
    ]
  }

internal ip config connector yaml:

apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeAddress
metadata:
  annotations:
    cnrm.cloud.google.com/deletion-policy: abandon
  name: xxxxxxxxxxx-internal-ip
  namespace: PROJECT_B
spec:
  address: 172.22.0.2
  addressType: INTERNAL
  description: Static internal ip
  ipVersion: IPV4
  location: europe-north1
  networkRef:
    external: https://compute.googleapis.com/compute/v1/projects/PROJECT_A/global/networks/mehivpc
  subnetworkRef:
    external: https://compute.googleapis.com/compute/v1/projects/PROJECT_A/regions/europe-north1/subnetworks/vnet-172-22-0-0-22-xxxxxx

And yeah, the config connector logs says nothing helpful, it says goes on saying it's regular things and everything looks good there:

# kubectl logs -f -n cnrm-system cnrm-controller-manager-xxxxxxxxxxxxxx-0  manager

{"severity":"info","logger":"computedisk-controller","msg":"starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"mysql-data"}}
{"severity":"info","logger":"computedisk-controller","msg":"creating/updating underlying resource","resource":{"namespace":"NAMESPACE_NAME","name":"mysql-data"}}
{"severity":"info","logger":"computedisk-controller","msg":"successfully finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"mysql-data"}}
{"severity":"info","logger":"computeaddress-controller","msg":"starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"db-external-ip"}}
{"severity":"info","logger":"computeaddress-controller","msg":"creating/updating underlying resource","resource":{"namespace":"NAMESPACE_NAME","name":"db-external-ip"}}
{"severity":"info","logger":"computeaddress-controller","msg":"successfully finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"db-external-ip"}}
{"severity":"info","logger":"computeaddress-controller","msg":"starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"db-internal-ip"}}
{"severity":"info","logger":"computeaddress-controller","msg":"creating/updating underlying resource","resource":{"namespace":"NAMESPACE_NAME","name":"db-internal-ip"}}
{"severity":"info","logger":"computeaddress-controller","msg":"successfully finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"db-internal-ip"}}
{"severity":"info","logger":"computeinstance-controller","msg":"starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"app-mysql"}}
{"severity":"info","logger":"computeinstance-controller","msg":"creating/updating underlying resource","resource":{"namespace":"NAMESPACE_NAME","name":"app-mysql"}}
{"severity":"info","logger":"computeinstance-controller","msg":"successfully finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"app-mysql"}}
{"severity":"info","logger":"storagebucket-controller","msg":"starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"PROJECT_NAME-mysql-backups"}}
{"severity":"info","logger":"storagebucket-controller","msg":"creating/updating underlying resource","resource":{"namespace":"NAMESPACE_NAME","name":"PROJECT_NAME-mysql-backups"}}
{"severity":"info","logger":"storagebucket-controller","msg":"successfully finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"PROJECT_NAME-mysql-backups"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-secretmanager-secret-accessor"}}
{"severity":"info","logger":"tfiamclient","msg":"underlying resource is already up to date","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-secretmanager-secret-accessor"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-secretmanager-secret-accessor"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-logging-log-writer"}}
{"severity":"info","logger":"tfiamclient","msg":"underlying resource is already up to date","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-logging-log-writer"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-logging-log-writer"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-monitoring-metric-writer"}}
{"severity":"info","logger":"tfiamclient","msg":"underlying resource is already up to date","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-monitoring-metric-writer"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-monitoring-metric-writer"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-secretmanager-viewer"}}
{"severity":"info","logger":"tfiamclient","msg":"underlying resource is already up to date","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-secretmanager-viewer"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-secretmanager-viewer"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Starting reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-storage-object-admin"}}
{"severity":"info","logger":"tfiamclient","msg":"underlying resource is already up to date","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-storage-object-admin"}}
{"severity":"info","logger":"iampolicymember-controller","msg":"Finished reconcile","resource":{"namespace":"NAMESPACE_NAME","name":"gsa-mysql-storage-object-admin"}}
Scorpiion commented 3 years ago

I can add that I have also tried to put the internal ip inside the shared vpc host project (I don't want to do it that way because of IAM rules/access etc) but can confirm that it is not working either so I guess I did it correctly having the internal ip inside the client project (subnet is part of host project).

This is the output when trying to have the internal ip in the host project:

status:
  conditions:
    - lastTransitionTime: "2020-11-03T13:40:08Z"
      message: 'Update call failed: error applying desired state: Error creating instance:
      googleapi: Error 400: Invalid value for field ''resource.networkInterfaces[0].networkIP'':
      ''https://compute.googleapis.com/compute/v1/projects/xxxxxxxx/regions/europe-north1/addresses/tmp-test-vm-internal-ip-2''.
      IP address ''projects/xxxxxxxx/regions/europe-north1/addresses/tmp-test-vm-internal-ip-2''
      (172.22.0.22) is reserved by another project., invalid'
      reason: UpdateFailed
      status: "False"
      type: Ready

I can also confirm that I have reproduced this same error on a new VM.

Scorpiion commented 3 years ago

Hi @kibbles-n-bytes, is there anything I can do to help progress this issue? This blocks some of our work and it would be very helpful if we could find a solution or workaround other than stop using config connector.

toumorokoshi commented 3 years ago

Hey @Scorpiion, thanks for the incredibly detailed debug information!

@kibbles-n-bytes is on vacation for a bit, but I'm catching up and will get you a reply by tomorrow at the latest.

At this point your description is clear to me. I'm doing some work to see if I can repro the situation on our side (starting with the NetworkIP external reference not working as expected), we don't want you to have to workaround config-connector either.

toumorokoshi commented 3 years ago

Random quick question: are you using Config Connector in namespace mode? It's a relatively new feature so I presume no, but just checking.

toumorokoshi commented 3 years ago

Hi! Quick update: I can repro scenario 1. We're having a discussion internally about the right resolution there.

Scorpiion commented 3 years ago

Hi @toumorokoshi, thanks for filling in and I hope I did not disturb @kibbles-n-bytes on his vacation.

We do use config connector in namespaced mode, I would say we are a very early adopter of config connector and have used it since early this year.

Great to hear that you were able to reproduce scenario 1. Thanks for moving this along and let me know if there is anything I can do to help! (I'm also open to do a hangout/google meet session if it would help)

toumorokoshi commented 3 years ago

Hi! To start with, wanted to provide some more information on a situation:

Repro

I can replicate two out of the three issues at play here:

Notes:

Mitigation

Fix

networkIP

Unfortunately, fixing networkIP to not error on a selflink is non-trivial. I'm looking into some ways to get this fixed, but the best choice for now is to hard-code the value. I can't give a good ETA on the fix.

I'm actually curious how you came to use a selflink: our documentation states that this must be an IP address: the fact that selflink is supported is an implementation detail, not a feature.

I presume the main reason you did this is because we don't support a ComputeAddress resourceRef for a networkIP. That's a bit easier, so I'm looking into it that first. I can come back with an answer in the next few days if this is a possibility.

sourceImageRef

This one is also a a bit tricky for reasons similar to networkIP, so I can't give a good ETA. I would recommend using the fully qualified image for now, rather than the family.

Final notes / questions

I apologize since I'm sure this isn't the answer you wanted, but I am exploring options. I'll update if anything is doable for the networkIP / sourceImageRef external errors, but can you confirm or deny that the ComputeAddressRef would be helpful if it existed as an option for networkIP?