GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
895 stars 229 forks source link

SQL entities updated every 10/20 minutes for no apparent reason #153

Closed marcoderama closed 4 years ago

marcoderama commented 4 years ago

I have a SQLInstance with one SQLDatabase and one SQLUser. The SQLUser password is specified via a k8s Secret. All is good. If I just let the cluster sit idle, the cnrm controller reconciles the entities every 10 minutes or so. I would expect the reconciliation to detect that nothing has changed and move on. Instead, the SQLUser is "updated" somehow. (This is confirmed by watching the state toggle briefly between UpToDate and Updating, and by log messages in the controller.) About every 20 mins the same happens to the SQLInstance.

If I look at the Operations page for the SQLInstance in the GCP console, I see repeated sequences like the following. The first corresponds with the SQLInstance "change".:

Apr 23, 2020, 1:33:51 PM | Update user | Password changed
Apr 23, 2020, 1:23:40 PM | Update user | Password changed
Apr 23, 2020, 1:21:30 PM | Update | Update finished

At the very least, it's disconcerting to see those "Password changed" messages. (There are pages of them for long-lived instances.) My operator is normally watching these entities as well. When the SQLUser goes to "Updating", my controller ends up setting the status of my controlled object to "NotReady". This is annoying and may be causing us other issues.

I've confirmed with multiple kubectl get X -w windows that nothing other than the status appears to change for the SQLInstance or SQLUser. Nothing changes for the k8s password Secret.

Is this expected behavior? And just to be clear, I've observed this on a quiet dev cluster without my operator running. It also happens on a GKE cluster.

This is using Config Connector 1.6.0 or 1.7.1 on a dev kind cluster and on GKE clusters.

caieo commented 4 years ago

Hi @marcoderama,

Just to make sure I'm understanding your question correctly, you're wondering if the cnrm controller should be toggling between ready status (ready status = the status that's in the ready condition, status.conditions.status) "True" and ready status "False" as SQLUser goes between "Updating"/"UpToDate"? This is expected behavior because when the controller is in the middle of updating a resource, then the resource's ready status should reflect that it is not ready. The ready status of resources managed by Config Connector will be either "True"/"False" depending on whether or not the resource is "UpToDate". Let me know if this answers your question/have follow-up questions.

marcoderama commented 4 years ago

Not really. My question is why is the controller updating the SQLUser every 10 minutes (and, it seems the SQLInstance every 20). The password didn't really change as far as I can tell. I can understand the reconciler looking in periodically, just not why it's making a change to the objects in this case.

It's a problem because while they're not "UpToDate", my controllers that are also watching them get triggered and have to change their state to "NotReady". I naively expect that once the SQLUser object is setup, it shouldn't have to be updated again, let alone every 10 minutes.

marcoderama commented 4 years ago

Is there any update on this?

spew commented 4 years ago

Hi @marcoderama, have you been able to reproduce this issue when NOT using k8s secrets? Or does it only occur when sourcing the password value from a secret?

marcoderama commented 4 years ago

Sorry. I haven't tried that. Never looked back once we got things working with secrets. :-)

I'll see if I can set up a test...

marcoderama commented 4 years ago

@spew, I can confirm that I get the same behavior when using a vanilla password instead of the k8s secret.

spew commented 4 years ago

Great, thank you for reporting and doing the extra analysis, we will investigate this one.

marcoderama commented 4 years ago

Just in case it might be useful, here's what I have for SQLInstance, SQLDatabase and SQLUser. I've anonymized and decrufted. As for the SQLInstance, I think the only things I actually specify in my yaml are region, tier, and privateNetworkRef.

apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
  annotations:
    cnrm.cloud.google.com/management-conflict-prevention-policy: resource
    cnrm.cloud.google.com/project-id: mygcpproject
  finalizers:
  - cnrm.cloud.google.com/finalizer
  - cnrm.cloud.google.com/deletion-defender
  generation: 79
  name: theinstance
  namespace: thenamespace
  ownerReferences: [my-controller]
spec:
  databaseVersion: POSTGRES_11
  region: us-east1
  settings:
    activationPolicy: ALWAYS
    backupConfiguration:
      startTime: "15:00"
    diskAutoresize: true
    diskSize: 10
    diskType: PD_SSD
    ipConfiguration:
      privateNetworkRef:
        external: projects/mygcpproject/global/networks/dev-network
    locationPreference:
      zone: us-east1-b
    pricingPlan: PER_USE
    replicationType: SYNCHRONOUS
    tier: db-custom-1-3840
---
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLDatabase
metadata:
  annotations:
    cnrm.cloud.google.com/management-conflict-prevention-policy: none
    cnrm.cloud.google.com/project-id: mygcpproject
  finalizers:
  - cnrm.cloud.google.com/finalizer
  - cnrm.cloud.google.com/deletion-defender
  generation: 15
  name: thedatabase
  namespace: thenamespace
  ownerReferences: [my-controller]
spec:
  charset: UTF8
  collation: en_US.UTF8
  instanceRef:
    name: theinstance
---
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLUser
metadata:
  annotations:
    cnrm.cloud.google.com/management-conflict-prevention-policy: none
    cnrm.cloud.google.com/project-id: mygcpproject
  finalizers:
  - cnrm.cloud.google.com/finalizer
  - cnrm.cloud.google.com/deletion-defender
  generation: 150
  name: theuser
  namespace: thenamespace
  ownerReferences: [my-controller]
spec:
  instanceRef:
    name: theinstance
  password:
    value: changeMeSomehow
spew commented 4 years ago

Thanks, that is super useful.

dinvlad commented 4 years ago

We're seeing this issue as well, and also with password passed through a secret.

marcoderama commented 4 years ago

I see a similar complaint in https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/164, which appears to have been effectively closed as "won't fix".

My problem with that "solution" is that it appears that while the lease is being renewed, the resource reports a status of NotReady. This is a concern for any controller that is watching that resource.

Can the lease be updated in such a way that it doesn't toggle the Ready status of the resource? (Maybe if it was stored in Status instead of a label?)

If not, is there a workaround? Would setting the cnrm.cloud.google.com/management-conflict-prevention-policy label to none make this go away? (Assuming that's appropriate for one's setup.)

AlexBulankou commented 4 years ago

@marcoderama , @dinvlad , thanks for the update. We continue to investigate the issue on our side and understand the issues in the underlying Google SQL API.

Can you please include include more details on how this is impacting your scenario - what logs are being flooded and what audit scenarios it can potentially be affecting.

Additionally, if are working with Google support or Google customer engineer, please ask them reach out to Config Connector team so we can handle this through the support channel and provide more context.

marcoderama commented 4 years ago

@AlexBulankou, for me, there are two issues.

1: If I look in the GCP console at the Operations page for a SQLInstance that's managed by Config Connector, I see repeated sequences like the following.:

Apr 23, 2020, 1:33:51 PM | Update user | Password changed
Apr 23, 2020, 1:23:40 PM | Update user | Password changed
Apr 23, 2020, 1:21:30 PM | Update | Update finished

For one of my instances I looked at, there are over 26000 lines of this stuff now. At the very least, this clutter makes it difficult to track down any actually interesting entries. On top of that, the "Password changed" message is disconcerting and, imho, wrong. I certainly didn't change any password so I spent a lot of time tracking this down to see if things were breaking because credentials had changed. (Others might wonder if there was a security issue.)

2: I have my own operator that creates and manages several object types, including SQLInstances, SQLDatabases, and SQLUsers. The status of my CR is a function of the status of its child objects. Every 10 minutes though, the SQLUser toggles between "Ready" and "NotReady", thus my CR goes to "NotReady" every 10 minutes as well. My system is therefore "down" briefly every 10 minutes. Users of my CR don't understand why things don't just work smoothly.

The latter issue is the more important one. Trying to figure out why my system was NotReady every 10 minutes is what led me down this path. Given that everything is actually fine with the SQLUser (and SQLInstance) (and that no password ever changed afaict), this seems like broken behavior to me.

AlexBulankou commented 4 years ago

Thanks @marcoderama - we are working on the resolution now and will continue to provide regular updates until this is resolved. If you are working with Google support or Google customer engineer, please ask them reach out to Config Connector team so we can provide additional context there.

sryabkov commented 4 years ago

We are suffering from the same issue. Every ~10 min, the SQL instance managed by KCC gets "updated" even though nothing changes:

We are using Config Connector 1.11

image

caieo commented 4 years ago

Hi @sryabkov, it appears something in SQL is triggering an update even though you're not trying to update anything. The current workaround to stop triggering this update is to set "management-conflict-prevention-policy" to none (you can also set it by directly annotating the namespace). Here is a link to our public documentation on this annotation.

sryabkov commented 4 years ago

Thank you, @caieo. Setting metadata.annotations.cnrm.cloud.google.com/management-conflict-prevention-policy: none on SQLInstance did not help. Updates every ~10 min continue after I did this.

jcanseco commented 4 years ago

Hi all, we put out a fix in 1.19.0 to prevent SQLUser from constantly updating despite there being no changes. Can you please try it out and see if it works for you? We apologize for the time this has taken; this has turned out to be a significantly more complex issue than initially expected.

jcanseco commented 4 years ago

@sryabkov, I apologize that we have not responded to your reply. Would you be able to share your SQLInstance YAML; it would help us determine what is potentially causing the constant updates.

sryabkov commented 4 years ago

@jcanseco, no worries, it looks like upgrade to Config Connector 1.17 (from 1.13) might have fixed the issue. I can see that updates to SQL instances stopped happening right after we upgraded CC on August 11th.

jcanseco commented 4 years ago

Gotcha, thanks @sryabkov. I'm very glad to hear that your issue was resolved in the end. For any issues in the future, please never hesitate to ask for follow-ups anytime.

marcoderama commented 4 years ago

Hmmm. I've just tested this with Config Connector 1.20.0 and my SQLInstance is still updating every 20 minutes. Anything I should be looking at?

jcanseco commented 4 years ago

Hi @marcoderama, can you confirm that your operations logs show "Update finished", but not "Password changed"?

The fix we put out in 1.19.0 prevents SQLUser from being constantly updated due to the password, but it sounds like your SQLInstance is being constantly updated due to the conflict prevention mechanism. You'll want to set the conflict prevention annotation to none for your SQLInstance like so:

metadata:
  annotations:
    cnrm.cloud.google.com/management-conflict-prevention-policy: "none"

For reference, this page talks more about KCC's conflict prevention mechanism.

dinvlad commented 4 years ago

We're on 1.19.1 (as GKE plugin) and are still getting this too - do you mean that cnrm.cloud.google.com/management-conflict-prevention-policy should be added in all cases specifically for SQLInstance from now on?

marcoderama commented 4 years ago

I can confirm that I'm getting "Update finished", and not "Password changed", in the logs now.

I can also confirm that setting the conflict-prevention-policy annotation to none on the SQLInstance makes even that go away.

That said, I've read the doco and don't quite understand... It seems to me that the conflict prevention policy is required in the case that one creates

two Config Connector resources in different namespaces that manage the same Google Cloud resource

I didn't (knowingly) do that, so can you help me understand why the annotation is needed? I can live with adding it. I'd just like to understand why it's needed.

sryabkov commented 4 years ago

@jcanseco, no worries, it looks like upgrade to Config Connector 1.17 (from 1.13) might have fixed the issue. I can see that updates to SQL instances stopped happening right after we upgraded CC on August 11th.

@jcanseco The issue was NOT resolved

With Config Connector 1.17, the instances were failing to update, and I didn't notice that. The updates were failing seemingly due to a bug in the handling of spec.settings.ipConfiguration.privateNetworkRef.external. We had it set to default, and the error message in the status field seemed to indicate that a selfLink was expected. In Config Connector 1.20, that bug seems to have been fixed because Config Connector is happy with a simple network name again. However, with Config Connector 1.20 our PG instances are once again being updated every 10 min. So, I can confirm @marcoderama's observations.

dinvlad commented 4 years ago

@marcoderama Interestingly, our Password changed messages disappeared today after I manually upgraded the GKE clusters (due to an unrelated CVE), which are now on 1.19.1 CC GKE plugin.

Interestingly, the new Update finished messages are also popping up every 15-20 minutes or so, so I'm curious what's going on here..

jcanseco commented 4 years ago

I didn't (knowingly) do that, so can you help me understand why the annotation is needed? I can live with adding it. I'd just like to understand why it's needed.

@marcoderama, @dinvlad: setting cnrm.cloud.google.com/management-conflict-prevention-policy to resource tells KCC that it needs to acquire a "lease" on the GCP resource before being able to update it. This is useful when a GCP resource is being managed from more than one KCC instance (e.g. if you have multiple K8s clusters with KCC installed managing the same resource, or if you're in namespaced-mode and you have multiple namespaces managing the same resource). Without the lease, you can end up in a scenario where you have multiple KCC instances "fighting" over the same resource (e.g. updating it back and forth between different configurations).

Management conflict prevention is enabled by default for resources that support it to avoid the "fighting" scenario. If you're only managing the resource from one KCC instance, then there is no issue in setting the annotation to none.

If you need support for management conflict prevention without the K8s resource entering a NotReady condition, please let us know and we can prioritize investigating the feasibility of doing that. However, there is likely no way to avoid logging Update finished on GCP-side since KCC needs to update the GCP resource's labels to refresh the lease.

However, with Config Connector 1.20 our PG instances are once again being updated every 10 min. So, I can confirm @marcoderama's observations.

@sryabkov: gotcha. To clarify, your observations are exactly the same as @marcoderama's right? (i.e. no "Password changed" logs, just "Update finished")?

marcoderama commented 4 years ago

Management conflict prevention is enabled by default for resources that support it to avoid the "fighting" scenario. If you're only managing the resource from one KCC instance, then there is no issue in setting the annotation to none.

That explains things for me. Thanks. I had the default semantics backwards in my head for some reason.

dinvlad commented 4 years ago

Yes, thanks for the explanation @jcanseco !

jcanseco commented 4 years ago

No problem, @marcoderama and @dinvlad. We're always happy to help. Please let us know if you have any other issues :)

sryabkov commented 4 years ago

@jcanseco

@sryabkov: gotcha. To clarify, your observations are exactly the same as @marcoderama's right? (i.e. no "Password changed" logs, just "Update finished")?

Just to clarify, I was talking about the SQLInstance resource, and here is what I am observing:

image

jcanseco commented 4 years ago

@sryabkov: thanks, that looks right to me. The operations logs for a SQL instance shows whenever the instance is updated and whenever a user's password is changed. The fact that you're not seeing "Password changed" logs means you either (1) don't have SQL users for this instance (unlikely), or (2) you do have SQL users and that the fix we put out in 1.19.0 to prevent SQL users from going through constant password updates works.

The fact that you're seeing "Update finished" constantly means that your SQL instance is going through constant updates -- a separate issue from the one fixed in 1.19.0. I suspect the reason your instance is constantly updating is the same as in this comment. The fix should also be the same provided you're ok with disabling management conflict prevention for your SQL instances, which I elaborate on in another comment.

dinvlad commented 4 years ago

@jcanseco I'm pretty sure we don't have any conflicts, but based on your other comment

However, there is likely no way to avoid logging Update finished on GCP-side since KCC needs to update the GCP resource's labels to refresh the lease.

my understanding is that this is just normal behavior, irregardless of if we had conflicts or not. Is that correct?

jcanseco commented 4 years ago

my understanding is that this is just normal behavior, irregardless of if we had conflicts or not. Is that correct?

@dinvlad, are you asking if the SQL operations logs will show "Update finished" any time the instance is updated, regardless how many KCC instances are updating the SQL instance and regardless if KCC's management conflict prevention mechanism is enabled/disabled? If so, then the answer is yes.

sryabkov commented 4 years ago

@jcanseco Is it accurate that setting metadata.annotations.cnrm.cloud.google.com/management-conflict-prevention-policy: none stops resource labels, which are used to manage KCC leases, from being created/updated, which in turn stops the instances from being constantly updated?

jcanseco commented 4 years ago

@sryabkov: yes that is correct

dinvlad commented 4 years ago

OK, that makes sense @jcanseco

marcoderama commented 4 years ago

I'm okay if you close this one now.

jcanseco commented 4 years ago

Thanks!