Open Jonpez2 opened 4 years ago
Hey @Jonpez2 , as you noted, there is some strange behavior on the underlying IAM API.
In the case of Config Connector, I am unable to reproduce this specific issue. I verified just now that, if both the service account and its IAM permission are represented as CC YAML, deleting both will get the expected behavior of deleting the service account and the permission, regardless of which order you issue the delete request to the cluster (or which CC happens to delete first, if both deletes are issued simultaneously). And similarly, recreating using the same YAML gives the expected behavior of attaching the behavior to the new service account; if the policy member happens to be created first, it will fail to attach the permission until the new service account exists.
Would you be able to give sample YAML and reproduction steps?
It's intermittent, and I can't figure out what exactly caused it. I had an issue opened with the GCP folk about it, so there's a support person in there who was able to narrow down times and events very quickly to tell me what was going wrong. Is there a way I can connect you with them? That might be the most efficient approach...
We're currently running some tests to see if we can catch it intermittently; we'll update with our findings. Could you tell support to reach out to the cnrm-oncall?
Will do
On Wed, 1 Apr 2020 at 18:00, Michael K. notifications@github.com wrote:
We're currently running some tests to see if we can catch it intermittently; we'll update with our findings. Could you tell support to reach out to the cnrm-oncall?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/123#issuecomment-607370868, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABN425JHEJEVBA74DEEC4KTRKNXJLANCNFSM4LWQGFOA .
I believe support reached out and you guys are digging - anything to report?
Thanks for the time!
Hi @Jonpez2, after some testing, we successfully reproduced the issue. It is intermittent so it will unlikely happen for most of the times. But in order to completely avoid hitting it, you can create the resources following the order below:
Meanwhile, we plan to looking into the fix in Config Connector. Will let you let you know if there is any update.
Yeah manually working around it will work for the moment, but I'm looking forward to removing any error-prone manual steps... Thanks very much for your help!
On Wed, Apr 8, 2020 at 7:44 PM maqiuyujoyce notifications@github.com wrote:
Hi @Jonpez2 https://github.com/Jonpez2, after some testing, we successfully reproduced the issue. It is intermittent so it will unlikely happen for most of the times. But in order to completely avoid hitting it, you can create the resources following the order below:
- Create a GSA (Google service account) using Config Connector, and make sure it is Ready.
- Create an IAMPolicyMember for the GSA.
Meanwhile, we plan to looking into the fix in Config Connector. Will let you let you know if there is an update.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/123#issuecomment-611126420, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABN425I2MS7WY4SBD67OXJ3RLTAY5ANCNFSM4LWQGFOA .
Any luck solving this? Just don't want it to bite us in prod as we move towards fully automated deployment.
Thank you!
On Thu, Apr 9, 2020 at 1:52 PM Jonathan Perry jonpez63@gmail.com wrote:
Yeah manually working around it will work for the moment, but I'm looking forward to removing any error-prone manual steps... Thanks very much for your help!
On Wed, Apr 8, 2020 at 7:44 PM maqiuyujoyce notifications@github.com wrote:
Hi @Jonpez2 https://github.com/Jonpez2, after some testing, we successfully reproduced the issue. It is intermittent so it will unlikely happen for most of the times. But in order to completely avoid hitting it, you can create the resources following the order below:
- Create a GSA (Google service account) using Config Connector, and make sure it is Ready.
- Create an IAMPolicyMember for the GSA.
Meanwhile, we plan to looking into the fix in Config Connector. Will let you let you know if there is an update.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/123#issuecomment-611126420, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABN425I2MS7WY4SBD67OXJ3RLTAY5ANCNFSM4LWQGFOA .
Hi @Jonpez2, we have plans to put out a proper fix, but it won't be out for some time. We can only recommend workarounds for now:
IAMServiceAccount
to be ready before creating the IAMPolicyMember
. If you want to do this in an automated fashion, you can use kubectl wait IAMServiceAccount [RESOURCE_NAME] --for-condition=ready
.Do you think either of these workarounds could work for your use-case?
Hey @Jonpez2 , to give some context on the scenario: in our testing, we've found around 1 out of every 250 attempts we were able to reproduce the issue (we've ran the scenario well over 2000 times now). This is due to the underlying IAM API having a small chance of accepting a policy change for an already-deleted service account.
The changes we plan to put in place are related to changing the member:serviceAccount:[account]@[project].iam.gserviceaccount.com
format in the policy to instead reference the IAMServiceAccount
k8s resource explicitly, which would allow us to check the service account's state to see if the new one has been created before issuing any policy requests. As this would be a breaking change, we need to take some time in order to put the proper migration steps in place, and do not have a timeframe we can share at this moment.
IAM is also in the process of making a change (ETA not yet known) to better showcase what is a permission attached to a deleted service account versus what is a permission attached to an existing service account. If these changes go in, Config Connector would automatically handle this situation as expected, as the "declarative source of truth" would clearly represent an un-ambiguous permission on an existing GSA.
I think this is now fixed, as I haven't seen it in a very long time.
I think this is now fixed, as I haven't seen it in a very long time.
I agree
Hi! I'll talk to the team to validate this is resolved, and close out if so.
Thanks, and also thank you for the fix!
On Tue, 5 Oct 2021 at 16:54, Yusuke Tsutsumi @.***> wrote:
Hi! I'll talk to the team to validate this is resolved, and close out if so.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/123#issuecomment-934537279, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABN425I4XRGJR776ANEFYVLUFMNUHANCNFSM4LWQGFOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Hello!
I've validated that this issue is not yet fixed, unfortunately. That said it's a hack in the config connector to get around some propagation delay in the IAM API (deletion doesn't seem to propagate), so it could well have been fixed in the API already (an internal search didn't find anything to that end).
I'll keep this ticket open for tracking since no action has been taken on our side.
I recently got myself into a state where my service account as represented on the GCP web console did not agree with the services I was trying to access. Specifically, I used CC to create a service account, and assigned it a role roles/storage.objectAdmin. I then deleted the configconnector resources in my cluster and recreated from scratch. I had done this many times before, but in this case the service account lost the ability to download images from GCR, and thus nothing on my application cluster worked. I debugged this with GCP support, who pointed out that this is expected behaviour according to https://cloud.google.com/iam/docs/understanding-service-accounts#deleting_and_recreating_service_accounts. A manual invocation of 'gsutil iam ch ...' to add the permission fixed the problem. However a) this isn't a supportable approach, and b) it turned out that the account had lost many other roles as well.
It further turns out that manually going to the IAM page and deleting the service account, followed by creating the service account via CC, left me with a working install.
What should I do to get rid of these manual steps?