Closed marcoderama closed 4 years ago
Hi @marcoderama,
Just to make sure I'm understanding your question correctly, you're wondering if the cnrm controller should be toggling between ready status (ready status = the status that's in the ready condition, status.conditions.status
) "True" and ready status "False" as SQLUser goes between "Updating"/"UpToDate"? This is expected behavior because when the controller is in the middle of updating a resource, then the resource's ready status should reflect that it is not ready. The ready status of resources managed by Config Connector will be either "True"/"False" depending on whether or not the resource is "UpToDate". Let me know if this answers your question/have follow-up questions.
Not really. My question is why is the controller updating the SQLUser every 10 minutes (and, it seems the SQLInstance every 20). The password didn't really change as far as I can tell. I can understand the reconciler looking in periodically, just not why it's making a change to the objects in this case.
It's a problem because while they're not "UpToDate", my controllers that are also watching them get triggered and have to change their state to "NotReady". I naively expect that once the SQLUser object is setup, it shouldn't have to be updated again, let alone every 10 minutes.
Is there any update on this?
Hi @marcoderama, have you been able to reproduce this issue when NOT using k8s secrets? Or does it only occur when sourcing the password value from a secret?
Sorry. I haven't tried that. Never looked back once we got things working with secrets. :-)
I'll see if I can set up a test...
@spew, I can confirm that I get the same behavior when using a vanilla password instead of the k8s secret.
Great, thank you for reporting and doing the extra analysis, we will investigate this one.
Just in case it might be useful, here's what I have for SQLInstance, SQLDatabase and SQLUser. I've anonymized and decrufted. As for the SQLInstance, I think the only things I actually specify in my yaml are region
, tier
, and privateNetworkRef
.
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
annotations:
cnrm.cloud.google.com/management-conflict-prevention-policy: resource
cnrm.cloud.google.com/project-id: mygcpproject
finalizers:
- cnrm.cloud.google.com/finalizer
- cnrm.cloud.google.com/deletion-defender
generation: 79
name: theinstance
namespace: thenamespace
ownerReferences: [my-controller]
spec:
databaseVersion: POSTGRES_11
region: us-east1
settings:
activationPolicy: ALWAYS
backupConfiguration:
startTime: "15:00"
diskAutoresize: true
diskSize: 10
diskType: PD_SSD
ipConfiguration:
privateNetworkRef:
external: projects/mygcpproject/global/networks/dev-network
locationPreference:
zone: us-east1-b
pricingPlan: PER_USE
replicationType: SYNCHRONOUS
tier: db-custom-1-3840
---
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLDatabase
metadata:
annotations:
cnrm.cloud.google.com/management-conflict-prevention-policy: none
cnrm.cloud.google.com/project-id: mygcpproject
finalizers:
- cnrm.cloud.google.com/finalizer
- cnrm.cloud.google.com/deletion-defender
generation: 15
name: thedatabase
namespace: thenamespace
ownerReferences: [my-controller]
spec:
charset: UTF8
collation: en_US.UTF8
instanceRef:
name: theinstance
---
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLUser
metadata:
annotations:
cnrm.cloud.google.com/management-conflict-prevention-policy: none
cnrm.cloud.google.com/project-id: mygcpproject
finalizers:
- cnrm.cloud.google.com/finalizer
- cnrm.cloud.google.com/deletion-defender
generation: 150
name: theuser
namespace: thenamespace
ownerReferences: [my-controller]
spec:
instanceRef:
name: theinstance
password:
value: changeMeSomehow
Thanks, that is super useful.
We're seeing this issue as well, and also with password passed through a secret.
I see a similar complaint in https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/164, which appears to have been effectively closed as "won't fix".
My problem with that "solution" is that it appears that while the lease is being renewed, the resource reports a status of NotReady. This is a concern for any controller that is watching that resource.
Can the lease be updated in such a way that it doesn't toggle the Ready status of the resource? (Maybe if it was stored in Status instead of a label?)
If not, is there a workaround? Would setting the cnrm.cloud.google.com/management-conflict-prevention-policy
label to none
make this go away? (Assuming that's appropriate for one's setup.)
@marcoderama , @dinvlad , thanks for the update. We continue to investigate the issue on our side and understand the issues in the underlying Google SQL API.
Can you please include include more details on how this is impacting your scenario - what logs are being flooded and what audit scenarios it can potentially be affecting.
Additionally, if are working with Google support or Google customer engineer, please ask them reach out to Config Connector team so we can handle this through the support channel and provide more context.
@AlexBulankou, for me, there are two issues.
1: If I look in the GCP console at the Operations page for a SQLInstance that's managed by Config Connector, I see repeated sequences like the following.:
Apr 23, 2020, 1:33:51 PM | Update user | Password changed
Apr 23, 2020, 1:23:40 PM | Update user | Password changed
Apr 23, 2020, 1:21:30 PM | Update | Update finished
For one of my instances I looked at, there are over 26000 lines of this stuff now. At the very least, this clutter makes it difficult to track down any actually interesting entries. On top of that, the "Password changed" message is disconcerting and, imho, wrong. I certainly didn't change any password so I spent a lot of time tracking this down to see if things were breaking because credentials had changed. (Others might wonder if there was a security issue.)
2: I have my own operator that creates and manages several object types, including SQLInstances, SQLDatabases, and SQLUsers. The status of my CR is a function of the status of its child objects. Every 10 minutes though, the SQLUser toggles between "Ready" and "NotReady", thus my CR goes to "NotReady" every 10 minutes as well. My system is therefore "down" briefly every 10 minutes. Users of my CR don't understand why things don't just work smoothly.
The latter issue is the more important one. Trying to figure out why my system was NotReady every 10 minutes is what led me down this path. Given that everything is actually fine with the SQLUser (and SQLInstance) (and that no password ever changed afaict), this seems like broken behavior to me.
Thanks @marcoderama - we are working on the resolution now and will continue to provide regular updates until this is resolved. If you are working with Google support or Google customer engineer, please ask them reach out to Config Connector team so we can provide additional context there.
We are suffering from the same issue. Every ~10 min, the SQL instance managed by KCC gets "updated" even though nothing changes:
We are using Config Connector 1.11
Hi @sryabkov, it appears something in SQL is triggering an update even though you're not trying to update anything. The current workaround to stop triggering this update is to set "management-conflict-prevention-policy" to none (you can also set it by directly annotating the namespace). Here is a link to our public documentation on this annotation.
Thank you, @caieo. Setting metadata.annotations.cnrm.cloud.google.com/management-conflict-prevention-policy: none
on SQLInstance
did not help. Updates every ~10 min continue after I did this.
Hi all, we put out a fix in 1.19.0 to prevent SQLUser
from constantly updating despite there being no changes. Can you please try it out and see if it works for you? We apologize for the time this has taken; this has turned out to be a significantly more complex issue than initially expected.
@sryabkov, I apologize that we have not responded to your reply. Would you be able to share your SQLInstance
YAML; it would help us determine what is potentially causing the constant updates.
@jcanseco, no worries, it looks like upgrade to Config Connector 1.17 (from 1.13) might have fixed the issue. I can see that updates to SQL instances stopped happening right after we upgraded CC on August 11th.
Gotcha, thanks @sryabkov. I'm very glad to hear that your issue was resolved in the end. For any issues in the future, please never hesitate to ask for follow-ups anytime.
Hmmm. I've just tested this with Config Connector 1.20.0 and my SQLInstance is still updating every 20 minutes. Anything I should be looking at?
Hi @marcoderama, can you confirm that your operations logs show "Update finished", but not "Password changed"?
The fix we put out in 1.19.0 prevents SQLUser
from being constantly updated due to the password, but it sounds like your SQLInstance
is being constantly updated due to the conflict prevention mechanism. You'll want to set the conflict prevention annotation to none
for your SQLInstance
like so:
metadata:
annotations:
cnrm.cloud.google.com/management-conflict-prevention-policy: "none"
For reference, this page talks more about KCC's conflict prevention mechanism.
We're on 1.19.1
(as GKE plugin) and are still getting this too - do you mean that cnrm.cloud.google.com/management-conflict-prevention-policy
should be added in all cases specifically for SQLInstance
from now on?
I can confirm that I'm getting "Update finished", and not "Password changed", in the logs now.
I can also confirm that setting the conflict-prevention-policy annotation to none
on the SQLInstance makes even that go away.
That said, I've read the doco and don't quite understand... It seems to me that the conflict prevention policy is required in the case that one creates
two Config Connector resources in different namespaces that manage the same Google Cloud resource
I didn't (knowingly) do that, so can you help me understand why the annotation is needed? I can live with adding it. I'd just like to understand why it's needed.
@jcanseco, no worries, it looks like upgrade to Config Connector 1.17 (from 1.13) might have fixed the issue. I can see that updates to SQL instances stopped happening right after we upgraded CC on August 11th.
@jcanseco The issue was NOT resolved
With Config Connector 1.17, the instances were failing to update, and I didn't notice that. The updates were failing seemingly due to a bug in the handling of spec.settings.ipConfiguration.privateNetworkRef.external
. We had it set to default
, and the error message in the status field seemed to indicate that a selfLink was expected. In Config Connector 1.20, that bug seems to have been fixed because Config Connector is happy with a simple network name again. However, with Config Connector 1.20 our PG instances are once again being updated every 10 min. So, I can confirm @marcoderama's observations.
@marcoderama Interestingly, our Password changed
messages disappeared today after I manually upgraded the GKE clusters (due to an unrelated CVE), which are now on 1.19.1
CC GKE plugin.
Interestingly, the new Update finished
messages are also popping up every 15-20 minutes or so, so I'm curious what's going on here..
I didn't (knowingly) do that, so can you help me understand why the annotation is needed? I can live with adding it. I'd just like to understand why it's needed.
@marcoderama, @dinvlad: setting cnrm.cloud.google.com/management-conflict-prevention-policy
to resource
tells KCC that it needs to acquire a "lease" on the GCP resource before being able to update it. This is useful when a GCP resource is being managed from more than one KCC instance (e.g. if you have multiple K8s clusters with KCC installed managing the same resource, or if you're in namespaced-mode and you have multiple namespaces managing the same resource). Without the lease, you can end up in a scenario where you have multiple KCC instances "fighting" over the same resource (e.g. updating it back and forth between different configurations).
Management conflict prevention is enabled by default for resources that support it to avoid the "fighting" scenario. If you're only managing the resource from one KCC instance, then there is no issue in setting the annotation to none
.
If you need support for management conflict prevention without the K8s resource entering a NotReady
condition, please let us know and we can prioritize investigating the feasibility of doing that. However, there is likely no way to avoid logging Update finished
on GCP-side since KCC needs to update the GCP resource's labels to refresh the lease.
However, with Config Connector 1.20 our PG instances are once again being updated every 10 min. So, I can confirm @marcoderama's observations.
@sryabkov: gotcha. To clarify, your observations are exactly the same as @marcoderama's right? (i.e. no "Password changed" logs, just "Update finished")?
Management conflict prevention is enabled by default for resources that support it to avoid the "fighting" scenario. If you're only managing the resource from one KCC instance, then there is no issue in setting the annotation to
none
.
That explains things for me. Thanks. I had the default semantics backwards in my head for some reason.
Yes, thanks for the explanation @jcanseco !
No problem, @marcoderama and @dinvlad. We're always happy to help. Please let us know if you have any other issues :)
@jcanseco
@sryabkov: gotcha. To clarify, your observations are exactly the same as @marcoderama's right? (i.e. no "Password changed" logs, just "Update finished")?
Just to clarify, I was talking about the SQLInstance
resource, and here is what I am observing:
@sryabkov: thanks, that looks right to me. The operations logs for a SQL instance shows whenever the instance is updated and whenever a user's password is changed. The fact that you're not seeing "Password changed" logs means you either (1) don't have SQL users for this instance (unlikely), or (2) you do have SQL users and that the fix we put out in 1.19.0 to prevent SQL users from going through constant password updates works.
The fact that you're seeing "Update finished" constantly means that your SQL instance is going through constant updates -- a separate issue from the one fixed in 1.19.0. I suspect the reason your instance is constantly updating is the same as in this comment. The fix should also be the same provided you're ok with disabling management conflict prevention for your SQL instances, which I elaborate on in another comment.
@jcanseco I'm pretty sure we don't have any conflicts, but based on your other comment
However, there is likely no way to avoid logging Update finished on GCP-side since KCC needs to update the GCP resource's labels to refresh the lease.
my understanding is that this is just normal behavior, irregardless of if we had conflicts or not. Is that correct?
my understanding is that this is just normal behavior, irregardless of if we had conflicts or not. Is that correct?
@dinvlad, are you asking if the SQL operations logs will show "Update finished" any time the instance is updated, regardless how many KCC instances are updating the SQL instance and regardless if KCC's management conflict prevention mechanism is enabled/disabled? If so, then the answer is yes.
@jcanseco Is it accurate that setting metadata.annotations.cnrm.cloud.google.com/management-conflict-prevention-policy: none
stops resource labels, which are used to manage KCC leases, from being created/updated, which in turn stops the instances from being constantly updated?
@sryabkov: yes that is correct
OK, that makes sense @jcanseco
I'm okay if you close this one now.
Thanks!
I have a SQLInstance with one SQLDatabase and one SQLUser. The SQLUser password is specified via a k8s Secret. All is good. If I just let the cluster sit idle, the cnrm controller reconciles the entities every 10 minutes or so. I would expect the reconciliation to detect that nothing has changed and move on. Instead, the SQLUser is "updated" somehow. (This is confirmed by watching the state toggle briefly between
UpToDate
andUpdating
, and by log messages in the controller.) About every 20 mins the same happens to the SQLInstance.If I look at the Operations page for the SQLInstance in the GCP console, I see repeated sequences like the following. The first corresponds with the SQLInstance "change".:
At the very least, it's disconcerting to see those "Password changed" messages. (There are pages of them for long-lived instances.) My operator is normally watching these entities as well. When the SQLUser goes to "Updating", my controller ends up setting the status of my controlled object to "NotReady". This is annoying and may be causing us other issues.
I've confirmed with multiple
kubectl get X -w
windows that nothing other than the status appears to change for the SQLInstance or SQLUser. Nothing changes for the k8s password Secret.Is this expected behavior? And just to be clear, I've observed this on a quiet dev cluster without my operator running. It also happens on a GKE cluster.
This is using Config Connector 1.6.0 or 1.7.1 on a dev kind cluster and on GKE clusters.