Closed timstoop closed 4 years ago
@timstoop thanks for providing such a detailed description! I have self-assigned this issue and will make sure to follow-up here 👍
@hasheddan To add more information: On stackdriver logging, a lot of this errors are raised:
{
insertId: "lnj9zhe12c1s"
logName: "projects/tksvas6-service-0-45995/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload: {
@type: "type.googleapis.com/google.cloud.audit.AuditLog"
authenticationInfo: {
principalEmail: "crossplane@tksvas6-45995.iam.gserviceaccount.com"
}
authorizationInfo: [
0: {
granted: true
permission: "cloudsql.instances.update"
resource: "instances/tks-system-banco2-gfzlc"
resourceAttributes: {
}
}
]
methodName: "cloudsql.instances.update"
request: {
@type: "type.googleapis.com/google.cloud.sql.v1beta4.SqlInstancesPatchRequest"
body: {
databaseVersion: "POSTGRES_11"
gceZone: "us-east1-b"
instanceType: "CLOUD_SQL_INSTANCE"
name: "tks-system-banco2-gfzlc"
region: "us-east1"
settings: {
activationPolicy: "ALWAYS"
backupConfiguration: {
startTime: "04:00"
}
dataDiskSizeGb: "20"
dataDiskType: "PD_SSD"
databaseFlags: [
0: {
name: "max_connections"
value: "500"
}
]
ipConfiguration: {
ipv4Enabled: false
privateNetwork: "projects/tksvas6-45995/global/networks/tksvas6"
}
locationPreference: {
zone: "us-east1-b"
}
pricingPlan: "PER_USE"
replicationType: "SYNCHRONOUS"
storageAutoResize: true
tier: "db-custom-1-3840"
}
}
instance: "tks-system-banco2-gfzlc"
project: "tksvas6-service-0-45995"
}
requestMetadata: {
callerIp: "gce-internal-ip"
destinationAttributes: {
}
requestAttributes: {
auth: {
}
time: "2020-02-07T14:50:04.090Z"
}
}
resourceName: "instances/tks-system-banco2-gfzlc"
response: {
}
serviceName: "cloudsql.googleapis.com"
status: {
code: 2
message: "UNKNOWN"
}
}
receiveTimestamp: "2020-02-07T14:50:04.398595480Z"
resource: {
labels: {
database_id: "tksvas6-service-0-45995:tks-system-banco2-gfzlc"
project_id: "tksvas6-service-0-45995"
region: "us-east1"
}
type: "cloudsql_database"
}
severity: "ERROR"
timestamp: "2020-02-07T14:50:04.086Z"
}
Hope this information helps.
I have been working through this issue, which is being caused by our late initialization behavior, and I have discovered that the problem is not how we are doing late initialization, our functionality there actually makes sense. If we removed the current functionality we would end up setting fields in the spec
of our managed resources to ""
and other zero values. The problem is rather that we are using the same late initialization to do comparison, which is a fundamentally different operation.
In the first case, we are saying here is this mostly populated spec
. If we did not provide a value in the spec and we see that it is set on the cloud provider, go ahead and update it here.
The second case is a very different. We are saying here is a guaranteed empty spec
. Please fill it with the values from the cloud provider.
The issue is that we are not respecting both Crossplane and the cloud provider's serialization habits. Even if a cloud provider has omitempty
on a field that they return, we get the zero value rather than the missing field. However, if we then marshalled the struct from the cloud provider to JSON bytes, we get the exact representation that they intended for us to have. Now currently, we are trying to translate everything into "Crossplane-language" for comparison, but that doesn't make much sense. We know how we want to translate Crossplane to the cloud provider, but we don't know as much about how to translate the cloud provider to Crossplane. Furthermore, it creates extra work to try and do both translations when we really only need one. Therefore, I feel as though it makes much more sense to translate Crossplane to the cloud provider, then compare.
This is relatively easy with something like CloudSQL, where there is only a single patch operation:
Old:
func IsUpToDate(in *v1beta1.CloudSQLInstanceParameters, currentState sqladmin.DatabaseInstance) bool {
currentParams := &v1beta1.CloudSQLInstanceParameters{}
LateInitializeSpec(currentParams, currentState)
return cmp.Equal(in, currentParams, cmpopts.IgnoreInterfaces(struct{ resource.AttributeReferencer }{}))
}
New (this needs to be cleaned up but you get the idea):
func IsUpToDate(in *v1beta1.CloudSQLInstanceParameters, currentState sqladmin.DatabaseInstance) bool {
database := GenerateDatabaseInstance(*in, "test-sql")
b, _ := json.Marshal(database)
b2, _ := json.Marshal(¤tState)
return cmp.Equal(b, b2)
}
You can see that we are now generating the GCP representation of CloudSQLInstance, then comparing, rather than trying to generate a Crossplane representation for comparison. By marshalling the structs to JSON, we take advantage of GCP's omitempty
preferences, so we are allowing it to declare which zero values should be omitted.
The top negative here is that more complicated APIs (like GKECluster
) require you to determine where the diff occurs such that the correct update method can be called. This is more challenging (but not impossible) when marshalling to JSON then comparing.
I have tested this out with the issues described above, and this appears to be functioning properly.
@hasheddan Great summary, thank you!
One thing about json comparison is that by nature JSON is unordered. So, the marshal operation will probably give you the results in a specific order but it won't guarantee that, especially when the diff is large. So, we need to use an ordering option on cmp
or something like Equate..
(we have something similar somewhere).
cmp.Equal
will treat the marshaled JSONs as []byte
and it doesn't seem like it has a special mechanism to actually see that they are JSON. I'm almost sure that there is no way to provide a reliable Option
for that type's comparison, since it's pretty much raw data and this would lead some false positives when the ordering of the keys change while their name or value stay the same.
Also our spec
is a subset of the provider's struct and the generated struct instance will not have, for example, the fields that we put under status
. Like our generated provider object won't have creationTimestamp
where as actual state object will and that will cause a false positive during comparison. So, this may not work in different cases.
What we could do is that we'd copy the retrieved provider struct instance and make GenerateDatabaseInstance
function take it as parameter and work on that already filled object field by field. Then compare the resulting object with the fetched object. If there is any difference, then we'll see. However, this comes with a caveat that when we use GenerateDatabaseInstance
to actually generate an object to be used in Patch
or Insert
, developer should call it via an empty DatabaseInstance
object so that we don't try to send a status field like creationTimestamp
in those requests, which could cause the request to fail because those are [Output Only]
.
New:
func IsUpToDate(in *v1beta1.CloudSQLInstanceParameters, currentState sqladmin.DatabaseInstance) bool {
desired := ¤tState.DeepCopy() // or some other method to get a deep copy
GenerateDatabaseInstance(in, desired)
return cmp.Equal(desired, currentState, cmpopts.EquateSliceOrderings)
}
// In other places
obj := &sqladmin.DatabaseInstance{}
GenerateDatabaseInstance(spec, obj)
if err := gcp.Insert(...); err != nil {return err} ...
@muvaf thanks for diving into this more! I think what you are proposing here is a drastically better solution. I'll do some experiments here and open up a preliminary PR for your review :)
Why is this issue closed? I still can see this behavior happening
@b4nst we believed we diagnosed and fixed the issue in https://github.com/crossplane/provider-gcp/pull/166, and have not had any recent reports over the last year to my knowledge. Please raise a new issue for the issue you're seeing, including examples of any XRs, Compositions, and the CloudSQLInstance you're using.
What happened?
After creating a CloudSQLInstance, the system keeps trying to make modifications to it. This can be seen in the console itself (the instances are perpetually in a state of being modified) and in the
state.atProvider.settingsVersion
, which keeps incrementing every few seconds. Sadly, there's nothing logged in stack-gcp or the other crossplane Pods that give any indication what it's trying to do.How can we reproduce it?
This is the resource we created:
No settings were changed after the creation. Here's a screenshot from the console:
What environment did it happen in?
Crossplane version: 0.7.0 Stack GCP version: 0.5.0 Kubernetes version: 1.15.9
Let me know if you need any additional information!