crossplane-contrib / provider-gcp

Crossplane GCP provider
Apache License 2.0
124 stars 101 forks source link

CloudSQLInstance keeps modifying the instance #164

Closed timstoop closed 4 years ago

timstoop commented 4 years ago

What happened?

After creating a CloudSQLInstance, the system keeps trying to make modifications to it. This can be seen in the console itself (the instances are perpetually in a state of being modified) and in the state.atProvider.settingsVersion, which keeps incrementing every few seconds. Sadly, there's nothing logged in stack-gcp or the other crossplane Pods that give any indication what it's trying to do.

How can we reproduce it?

This is the resource we created:

apiVersion: database.gcp.crossplane.io/v1beta1
kind: CloudSQLInstance
metadata:
  annotations:
    crossplane.io/external-name: infra-xplane-test-jkgk7
  creationTimestamp: "2020-02-07T14:03:51Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generateName: infra-xplane-test-
  generation: 2
  name: infra-xplane-test-jkgk7
  resourceVersion: "35099427"
  selfLink: /apis/database.gcp.crossplane.io/v1beta1/cloudsqlinstances/infra-xplane-test-jkgk7
  uid: 0201d2c2-3418-4d55-8f7f-18efd83d8e4a
spec:
  claimRef:
    apiVersion: database.crossplane.io/v1alpha1
    kind: PostgreSQLInstance
    name: xplane-test
    namespace: infra
    uid: 8f12b8aa-1691-486b-9694-4a9c854e1a6b
  classRef:
    apiVersion: database.gcp.crossplane.io/v1beta1
    kind: CloudSQLInstanceClass
    name: postgres-9.6
    uid: fa73b0ff-9f9b-4786-84b2-9f89153ceb4a
  forProvider:
    databaseVersion: POSTGRES_9_6
    gceZone: europe-west1-d
    instanceType: CLOUD_SQL_INSTANCE
    region: europe-west1
    settings:
      activationPolicy: ALWAYS
      backupConfiguration:
        binaryLogEnabled: false
        enabled: false
        startTime: "09:00"
      dataDiskSizeGb: 10
      dataDiskType: PD_SSD
      ipConfiguration:
        privateNetwork: projects/redacted/global/networks/default
      locationPreference:
        zone: europe-west1-d
      pricingPlan: PER_USE
      replicationType: SYNCHRONOUS
      storageAutoResize: true
      storageAutoResizeLimit: 500
      tier: db-g1-small
  providerRef:
    name: redacted-crossplane-provider
  reclaimPolicy: Delete
  writeConnectionSecretToRef:
    name: 8f12b8aa-1691-486b-9694-4a9c854e1a6b
    namespace: infra
status:
  atProvider:
    backendType: SECOND_GEN
    connectionName: redacted:europe-west1:infra-xplane-test-jkgk7
    gceZone: europe-west1-d
    ipAddresses:
    - ipAddress: 10.6.1.11
      type: PRIVATE
    project: redacted
    selfLink: https://www.googleapis.com/sql/v1beta4/projects/redacted/instances/infra-xplane-test-jkgk7
    serviceAccountEmailAddress: redacted@gcp-sa-cloud-sql.iam.gserviceaccount.com
    settingsVersion: 58
    state: RUNNABLE
  bindingPhase: Bound
  conditions:
  - lastTransitionTime: "2020-02-07T14:03:53Z"
    reason: Successfully resolved resource references to other resources
    status: "True"
    type: ReferencesResolved
  - lastTransitionTime: "2020-02-07T14:08:35Z"
    reason: Resource is available for use
    status: "True"
    type: Ready
  - lastTransitionTime: "2020-02-07T14:03:55Z"
    reason: Successfully reconciled resource
    status: "True"
    type: Synced

No settings were changed after the creation. Here's a screenshot from the console: Console screenshot

What environment did it happen in?

Crossplane version: 0.7.0 Stack GCP version: 0.5.0 Kubernetes version: 1.15.9

Let me know if you need any additional information!

hasheddan commented 4 years ago

@timstoop thanks for providing such a detailed description! I have self-assigned this issue and will make sure to follow-up here 👍

vasartori commented 4 years ago

@hasheddan To add more information: On stackdriver logging, a lot of this errors are raised:

{
 insertId: "lnj9zhe12c1s"  
 logName: "projects/tksvas6-service-0-45995/logs/cloudaudit.googleapis.com%2Factivity"  
 protoPayload: {
  @type: "type.googleapis.com/google.cloud.audit.AuditLog"   
  authenticationInfo: {
   principalEmail: "crossplane@tksvas6-45995.iam.gserviceaccount.com"    
  }
  authorizationInfo: [
   0: {
    granted: true     
    permission: "cloudsql.instances.update"     
    resource: "instances/tks-system-banco2-gfzlc"     
    resourceAttributes: {
    }
   }
  ]
  methodName: "cloudsql.instances.update"   
  request: {
   @type: "type.googleapis.com/google.cloud.sql.v1beta4.SqlInstancesPatchRequest"    
   body: {
    databaseVersion: "POSTGRES_11"     
    gceZone: "us-east1-b"     
    instanceType: "CLOUD_SQL_INSTANCE"     
    name: "tks-system-banco2-gfzlc"     
    region: "us-east1"     
    settings: {
     activationPolicy: "ALWAYS"      
     backupConfiguration: {
      startTime: "04:00"       
     }
     dataDiskSizeGb: "20"      
     dataDiskType: "PD_SSD"      
     databaseFlags: [
      0: {
       name: "max_connections"        
       value: "500"        
      }
     ]
     ipConfiguration: {
      ipv4Enabled: false       
      privateNetwork: "projects/tksvas6-45995/global/networks/tksvas6"       
     }
     locationPreference: {
      zone: "us-east1-b"       
     }
     pricingPlan: "PER_USE"      
     replicationType: "SYNCHRONOUS"      
     storageAutoResize: true      
     tier: "db-custom-1-3840"      
    }
   }
   instance: "tks-system-banco2-gfzlc"    
   project: "tksvas6-service-0-45995"    
  }
  requestMetadata: {
   callerIp: "gce-internal-ip"    
   destinationAttributes: {
   }
   requestAttributes: {
    auth: {
    }
    time: "2020-02-07T14:50:04.090Z"     
   }
  }
  resourceName: "instances/tks-system-banco2-gfzlc"   
  response: {
  }
  serviceName: "cloudsql.googleapis.com"   
  status: {
   code: 2    
   message: "UNKNOWN"    
  }
 }
 receiveTimestamp: "2020-02-07T14:50:04.398595480Z"  
 resource: {
  labels: {
   database_id: "tksvas6-service-0-45995:tks-system-banco2-gfzlc"    
   project_id: "tksvas6-service-0-45995"    
   region: "us-east1"    
  }
  type: "cloudsql_database"   
 }
 severity: "ERROR"  
 timestamp: "2020-02-07T14:50:04.086Z"  
}

Hope this information helps.

hasheddan commented 4 years ago

I have been working through this issue, which is being caused by our late initialization behavior, and I have discovered that the problem is not how we are doing late initialization, our functionality there actually makes sense. If we removed the current functionality we would end up setting fields in the spec of our managed resources to "" and other zero values. The problem is rather that we are using the same late initialization to do comparison, which is a fundamentally different operation.

In the first case, we are saying here is this mostly populated spec. If we did not provide a value in the spec and we see that it is set on the cloud provider, go ahead and update it here. The second case is a very different. We are saying here is a guaranteed empty spec. Please fill it with the values from the cloud provider.

The issue is that we are not respecting both Crossplane and the cloud provider's serialization habits. Even if a cloud provider has omitempty on a field that they return, we get the zero value rather than the missing field. However, if we then marshalled the struct from the cloud provider to JSON bytes, we get the exact representation that they intended for us to have. Now currently, we are trying to translate everything into "Crossplane-language" for comparison, but that doesn't make much sense. We know how we want to translate Crossplane to the cloud provider, but we don't know as much about how to translate the cloud provider to Crossplane. Furthermore, it creates extra work to try and do both translations when we really only need one. Therefore, I feel as though it makes much more sense to translate Crossplane to the cloud provider, then compare.

This is relatively easy with something like CloudSQL, where there is only a single patch operation:

Old:

func IsUpToDate(in *v1beta1.CloudSQLInstanceParameters, currentState sqladmin.DatabaseInstance) bool {
    currentParams := &v1beta1.CloudSQLInstanceParameters{}
    LateInitializeSpec(currentParams, currentState)
    return cmp.Equal(in, currentParams, cmpopts.IgnoreInterfaces(struct{ resource.AttributeReferencer }{}))
}

New (this needs to be cleaned up but you get the idea):

func IsUpToDate(in *v1beta1.CloudSQLInstanceParameters, currentState sqladmin.DatabaseInstance) bool {
    database := GenerateDatabaseInstance(*in, "test-sql")
    b, _ := json.Marshal(database)
    b2, _ := json.Marshal(&currentState)
    return cmp.Equal(b, b2)
}

You can see that we are now generating the GCP representation of CloudSQLInstance, then comparing, rather than trying to generate a Crossplane representation for comparison. By marshalling the structs to JSON, we take advantage of GCP's omitempty preferences, so we are allowing it to declare which zero values should be omitted.

The top negative here is that more complicated APIs (like GKECluster) require you to determine where the diff occurs such that the correct update method can be called. This is more challenging (but not impossible) when marshalling to JSON then comparing.

I have tested this out with the issues described above, and this appears to be functioning properly.

muvaf commented 4 years ago

@hasheddan Great summary, thank you!

One thing about json comparison is that by nature JSON is unordered. So, the marshal operation will probably give you the results in a specific order but it won't guarantee that, especially when the diff is large. So, we need to use an ordering option on cmp or something like Equate..(we have something similar somewhere).

muvaf commented 4 years ago

cmp.Equal will treat the marshaled JSONs as []byte and it doesn't seem like it has a special mechanism to actually see that they are JSON. I'm almost sure that there is no way to provide a reliable Option for that type's comparison, since it's pretty much raw data and this would lead some false positives when the ordering of the keys change while their name or value stay the same.

Also our spec is a subset of the provider's struct and the generated struct instance will not have, for example, the fields that we put under status. Like our generated provider object won't have creationTimestamp where as actual state object will and that will cause a false positive during comparison. So, this may not work in different cases.

What we could do is that we'd copy the retrieved provider struct instance and make GenerateDatabaseInstance function take it as parameter and work on that already filled object field by field. Then compare the resulting object with the fetched object. If there is any difference, then we'll see. However, this comes with a caveat that when we use GenerateDatabaseInstance to actually generate an object to be used in Patch or Insert, developer should call it via an empty DatabaseInstance object so that we don't try to send a status field like creationTimestamp in those requests, which could cause the request to fail because those are [Output Only].

New:

func IsUpToDate(in *v1beta1.CloudSQLInstanceParameters, currentState sqladmin.DatabaseInstance) bool {
        desired := &currentState.DeepCopy() // or some other method to get a deep copy
    GenerateDatabaseInstance(in, desired)
    return cmp.Equal(desired, currentState, cmpopts.EquateSliceOrderings)
}

// In other places
obj := &sqladmin.DatabaseInstance{}
GenerateDatabaseInstance(spec, obj)
if err := gcp.Insert(...); err != nil {return err} ...
hasheddan commented 4 years ago

@muvaf thanks for diving into this more! I think what you are proposing here is a drastically better solution. I'll do some experiments here and open up a preliminary PR for your review :)

b4nst commented 3 years ago

Why is this issue closed? I still can see this behavior happening

negz commented 3 years ago

@b4nst we believed we diagnosed and fixed the issue in https://github.com/crossplane/provider-gcp/pull/166, and have not had any recent reports over the last year to my knowledge. Please raise a new issue for the issue you're seeing, including examples of any XRs, Compositions, and the CloudSQLInstance you're using.