GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
888 stars 219 forks source link

IAMServiceAccount deletion/recreation ends up in a bad state #123

Open Jonpez2 opened 4 years ago

Jonpez2 commented 4 years ago

I recently got myself into a state where my service account as represented on the GCP web console did not agree with the services I was trying to access. Specifically, I used CC to create a service account, and assigned it a role roles/storage.objectAdmin. I then deleted the configconnector resources in my cluster and recreated from scratch. I had done this many times before, but in this case the service account lost the ability to download images from GCR, and thus nothing on my application cluster worked. I debugged this with GCP support, who pointed out that this is expected behaviour according to https://cloud.google.com/iam/docs/understanding-service-accounts#deleting_and_recreating_service_accounts. A manual invocation of 'gsutil iam ch ...' to add the permission fixed the problem. However a) this isn't a supportable approach, and b) it turned out that the account had lost many other roles as well.

It further turns out that manually going to the IAM page and deleting the service account, followed by creating the service account via CC, left me with a working install.

What should I do to get rid of these manual steps?

kibbles-n-bytes commented 4 years ago

Hey @Jonpez2 , as you noted, there is some strange behavior on the underlying IAM API.

In the case of Config Connector, I am unable to reproduce this specific issue. I verified just now that, if both the service account and its IAM permission are represented as CC YAML, deleting both will get the expected behavior of deleting the service account and the permission, regardless of which order you issue the delete request to the cluster (or which CC happens to delete first, if both deletes are issued simultaneously). And similarly, recreating using the same YAML gives the expected behavior of attaching the behavior to the new service account; if the policy member happens to be created first, it will fail to attach the permission until the new service account exists.

Would you be able to give sample YAML and reproduction steps?

Jonpez2 commented 4 years ago

It's intermittent, and I can't figure out what exactly caused it. I had an issue opened with the GCP folk about it, so there's a support person in there who was able to narrow down times and events very quickly to tell me what was going wrong. Is there a way I can connect you with them? That might be the most efficient approach...

kibbles-n-bytes commented 4 years ago

We're currently running some tests to see if we can catch it intermittently; we'll update with our findings. Could you tell support to reach out to the cnrm-oncall?

Jonpez2 commented 4 years ago

Will do

On Wed, 1 Apr 2020 at 18:00, Michael K. notifications@github.com wrote:

We're currently running some tests to see if we can catch it intermittently; we'll update with our findings. Could you tell support to reach out to the cnrm-oncall?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/123#issuecomment-607370868, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABN425JHEJEVBA74DEEC4KTRKNXJLANCNFSM4LWQGFOA .

Jonpez2 commented 4 years ago

I believe support reached out and you guys are digging - anything to report?

Thanks for the time!

maqiuyujoyce commented 4 years ago

Hi @Jonpez2, after some testing, we successfully reproduced the issue. It is intermittent so it will unlikely happen for most of the times. But in order to completely avoid hitting it, you can create the resources following the order below:

  1. Create a GSA (Google service account) using Config Connector, and make sure it is Ready.
  2. Create an IAMPolicyMember for the GSA.

Meanwhile, we plan to looking into the fix in Config Connector. Will let you let you know if there is any update.

Jonpez2 commented 4 years ago

Yeah manually working around it will work for the moment, but I'm looking forward to removing any error-prone manual steps... Thanks very much for your help!

On Wed, Apr 8, 2020 at 7:44 PM maqiuyujoyce notifications@github.com wrote:

Hi @Jonpez2 https://github.com/Jonpez2, after some testing, we successfully reproduced the issue. It is intermittent so it will unlikely happen for most of the times. But in order to completely avoid hitting it, you can create the resources following the order below:

  1. Create a GSA (Google service account) using Config Connector, and make sure it is Ready.
  2. Create an IAMPolicyMember for the GSA.

Meanwhile, we plan to looking into the fix in Config Connector. Will let you let you know if there is an update.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/123#issuecomment-611126420, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABN425I2MS7WY4SBD67OXJ3RLTAY5ANCNFSM4LWQGFOA .

Jonpez2 commented 4 years ago

Any luck solving this? Just don't want it to bite us in prod as we move towards fully automated deployment.

Thank you!

On Thu, Apr 9, 2020 at 1:52 PM Jonathan Perry jonpez63@gmail.com wrote:

Yeah manually working around it will work for the moment, but I'm looking forward to removing any error-prone manual steps... Thanks very much for your help!

On Wed, Apr 8, 2020 at 7:44 PM maqiuyujoyce notifications@github.com wrote:

Hi @Jonpez2 https://github.com/Jonpez2, after some testing, we successfully reproduced the issue. It is intermittent so it will unlikely happen for most of the times. But in order to completely avoid hitting it, you can create the resources following the order below:

  1. Create a GSA (Google service account) using Config Connector, and make sure it is Ready.
  2. Create an IAMPolicyMember for the GSA.

Meanwhile, we plan to looking into the fix in Config Connector. Will let you let you know if there is an update.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/123#issuecomment-611126420, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABN425I2MS7WY4SBD67OXJ3RLTAY5ANCNFSM4LWQGFOA .

jcanseco commented 4 years ago

Hi @Jonpez2, we have plans to put out a proper fix, but it won't be out for some time. We can only recommend workarounds for now:

  1. Wait for the IAMServiceAccount to be ready before creating the IAMPolicyMember. If you want to do this in an automated fashion, you can use kubectl wait IAMServiceAccount [RESOURCE_NAME] --for-condition=ready.
  2. Alternatively, add a timed delay between deletion and recreation of your resources.

Do you think either of these workarounds could work for your use-case?

kibbles-n-bytes commented 4 years ago

Hey @Jonpez2 , to give some context on the scenario: in our testing, we've found around 1 out of every 250 attempts we were able to reproduce the issue (we've ran the scenario well over 2000 times now). This is due to the underlying IAM API having a small chance of accepting a policy change for an already-deleted service account.

The changes we plan to put in place are related to changing the member:serviceAccount:[account]@[project].iam.gserviceaccount.com format in the policy to instead reference the IAMServiceAccount k8s resource explicitly, which would allow us to check the service account's state to see if the new one has been created before issuing any policy requests. As this would be a breaking change, we need to take some time in order to put the proper migration steps in place, and do not have a timeframe we can share at this moment.

IAM is also in the process of making a change (ETA not yet known) to better showcase what is a permission attached to a deleted service account versus what is a permission attached to an existing service account. If these changes go in, Config Connector would automatically handle this situation as expected, as the "declarative source of truth" would clearly represent an un-ambiguous permission on an existing GSA.

Jonpez2 commented 2 years ago

I think this is now fixed, as I haven't seen it in a very long time.

derekperkins commented 2 years ago

I think this is now fixed, as I haven't seen it in a very long time.

I agree

toumorokoshi commented 2 years ago

Hi! I'll talk to the team to validate this is resolved, and close out if so.

Jonpez2 commented 2 years ago

Thanks, and also thank you for the fix!

On Tue, 5 Oct 2021 at 16:54, Yusuke Tsutsumi @.***> wrote:

Hi! I'll talk to the team to validate this is resolved, and close out if so.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/123#issuecomment-934537279, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABN425I4XRGJR776ANEFYVLUFMNUHANCNFSM4LWQGFOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

toumorokoshi commented 2 years ago

Hello!

I've validated that this issue is not yet fixed, unfortunately. That said it's a hack in the config connector to get around some propagation delay in the IAM API (deletion doesn't seem to propagate), so it could well have been fixed in the API already (an internal search didn't find anything to that end).

I'll keep this ticket open for tracking since no action has been taken on our side.