aws-samples / aws-iam-identity-center-extensions

This solution is intended for enterprises that need a streamlined way of managing user access to their AWS accounts. Using this solution, your identity and access management teams can extend AWS SSO functionality by automating common access management and governance use cases
MIT License
65 stars 24 forks source link

V3.1.5 #89

Closed leelalagudu closed 2 years ago

leelalagudu commented 2 years ago

Issue #, if available: Fixes #88, Fixes #87, Fixes #86, Fixes #82

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

leelalagudu commented 2 years ago

@jmejco , @tamara-h , the PR's security checks would fail as the CI build project's config file does not yet have SupportNestedOU parameter. Please update the CI repository with this config value, and re-trigger the security checks build project.

allquixotic commented 2 years ago

Fails to self-mutate between current version in master and the new version because, in the OrgMain account, on updating the CF stack env-aws-sso-extensions-for-enterprise-orgEventsProcessorStack, envprocessTargetAccountSM fails because Resource handler returned message: "'arn:aws:iam::accountid:role/env-processTargetAccountSMRole' is not authorized to create managed-rule. (Service: AWSStepFunctions; Status Code: 400; Error Code: AccessDeniedException; Request ID: ... guid; ... etc

It's really a pain when the self-mutation fails :(

allquixotic commented 2 years ago

It created drift, but I manually attached the AWS managed policy CloudWatchEventsFullAccess to the env-processTargetAccountSMRole to resolve this issue.

See https://stackoverflow.com/questions/60612853/nested-step-function-in-a-step-function-unknown-error-not-authorized-to-cr

leelalagudu commented 2 years ago

It created drift, but I manually attached the AWS managed policy CloudWatchEventsFullAccess to the env-processTargetAccountSMRole to resolve this issue.

See https://stackoverflow.com/questions/60612853/nested-step-function-in-a-step-function-unknown-error-not-authorized-to-cr

I am not sure why the self-mutate is'nt picking this up. I was aware of this issue already, and had speicifically added https://github.com/aws-samples/aws-sso-extensions-for-enterprise/blob/8647fa894da05dd3de2d37709cc9896926ce4ff8/lib/stacks/pipelineStageStacks/org-events-processor.ts#L231 to handle this access issue.

@allquixotic , you are on the latest commit for main in 3.1.4 right? And, upgraded to my branch for 3.1.5? Also, in your deployment in orgmain, can you look at env-processTargetAccountSMRole and validate if a custom policy with the permissions described in the code snippeet above has been created? I wanted to see if the policy itself is not creating?

@jmejco , can you please validate the behaviour your end and see if the self-mutate picks up the IAM poilcy as part of the change set and applies it?

allquixotic commented 2 years ago

OK... unless I'm missing something, this PR actually makes my problem with S3 permission sets worse. Now I'm getting must have required property 'Sid' and must match a schema in anyOf for every statement in every inlinePolicyDocument. These are valid permission sets that used to work even in 3.1.4.

Sid is not a required property of an IAM policy statement, either in IAM itself, or in AWS SSO permission sets' inline policies.

allquixotic commented 2 years ago

Also, I'm still getting ThrottlingExceptions on S3 links_data updates. I am pushing a lot of them at once during a migration (somewhere around 300+) so maybe my test scenario is unusual, but the only way I can get through the throttling exceptions is to sleep for several seconds between each S3 upload.

allquixotic commented 2 years ago

@allquixotic , you are on the latest commit for main in 3.1.4 right?

Previously, yes.

And, upgraded to my branch for 3.1.5?

Yes.

Also, in your deployment in orgmain, can you look at env-processTargetAccountSMRole and validate if a custom policy with the permissions described in the code snippeet above has been created? I wanted to see if the policy itself is not creating?

Those permissions do not exist in envprocessTargetAccountSMRoleDefaultPolicy... and there is no other inline or managed policy attached to env-processTargetAccountSMRole.

allquixotic commented 2 years ago

The recent fixes (current as of commit id 2958f9d) get rid of all the errors I’ve been seeing. However, when updating an existing permission set that had no inline policy and no managed policies, when I try to add AWS managed policies in the S3 permission set, it doesn’t provide any error over SNS, but it doesn’t update.

Here is my permission set data:

{
  “permissionSetName”: “foo”, “sessionDurationInMinutes”: “720”, 
  “managedPoliciesArnList”: [“arn:aws:iam::aws:policy/AdministratorAccess”, “arn:aws:iam::aws:policy/AWSSupportAccess”],
  “inlinePolicyDocument”: {},
  “relayState”: “https://us-east-1.console.aws.amazon.com/console/home?region=us-east-1#”,
  “tags”: [ { “Key”: “something”, “Value”: “someval” } ]
}

The previously existing permission set in AWS SSO has zero managed policies and no inline policy.

Also, even after the latest commit (2958f9d), I still get rate limit exceeded when pushing to S3 many (new) permission sets, permission set updates, new links data, or links data updates. I have to sleep for about 9 seconds between S3 PutObject calls to keep it from getting rate limited.

For clarification, I’m not getting rate limited by S3; it’s the Lambda functions within SSO Extensions that get rate limited by AWS SSO, far as I can tell.

leelalagudu commented 2 years ago

Hi @allquixotic ,

I had re-factored the queueing design (now taking into consideration that Lambda-SQS trigger does not really abide by concurrency limits you set up for the lambda) such that using both a batch size of 1, along with message group ID's ranging between 0 and 9 (take the last digit of the account) and the queue being FIFO, this should theoretically stay within the 10 live calls threshold at any given time.

Rather than hyptohesising this further, we will be setting up load test environment with approximately ~50 accounts and validate /fix the behaviour.

Regarding the permission set scenario you described, using the same commit ID, I first created a PS with empty inline policy and managed policies array, and then updated it to reflect this:

{ "permissionSetName": "SysAdmin-ps", "sessionDurationInMinutes": "240", "managedPoliciesArnList": ["arn:aws:iam::aws:policy/AdministratorAccess", "arn:aws:iam::aws:policy/AWSSupportAccess"], "inlinePolicyDocument": {}, "relayState": "https://eu-west-1.console.aws.amazon.com/systems-manager/managed-instances?region=eu-west-1#", "tags": [ { "Key": "versionid", "Value": "01" }, { "Key": "team", "Value": "SysAdmins" } ] } , and this change propagated successfully in my environment and I could see the change trickle down through SSO. Can you please check this once again? Only because I am unable to re-produce this locally

allquixotic commented 2 years ago

Regarding the permission set scenario you described, using the same commit ID, I first created a PS with empty inline policy and managed policies array, and then updated it to reflect this:

...

, and this change propagated successfully in my environment and I could see the change trickle down through SSO. Can you please check this once again? Only because I am unable to re-produce this locally

I tried it again, and it still does not attach the managed policies. It doesn't throw any error over SNS either. Is there some specific function I should look for logs?

leelalagudu commented 2 years ago

Regarding the permission set scenario you described, using the same commit ID, I first created a PS with empty inline policy and managed policies array, and then updated it to reflect this:

...

, and this change propagated successfully in my environment and I could see the change trickle down through SSO. Can you please check this once again? Only because I am unable to re-produce this locally

I tried it again, and it still does not attach the managed policies. It doesn't throw any error over SNS either. Is there some specific function I should look for logs?

@allquixotic , when you are provisioning the changed permission set, can you see what's happening from logs in the log group /aws/lambda/env-permissionSetTopicProcessor please? This is where the delta is calculated and processed. If there are no logs populated here when you are pushing the updated PS through S3, can you check what's happening in the log group /aws/lambda/env-psCuHandler please? This is the user interface handler, and we will know if there's an issue at this layer.

Thank you, Leela

jjleigh commented 2 years ago

We have tested the Nested OU and account assignment to orgs with over 40 accounts and that is all working. An issue that did arise was doing account assignment at the root level with 126 accounts. This fails in the step function with the error Error - Organizations.InvalidInputException Cause - You specified an invalid value for nextToken. You must get the value from the response to a previous call to the API. (Service: Organizations, Status Code: 400, Request ID: 8979fd76-2be8-4ba7-9dc5-c1c5de413487)

Root mapping is only mapping to five accounts, no error in lambda or no error email. When the step function was checked the error above was noticed.

Mapping file - root%all%aws-aes-engg-sso-ps%aws-aes-engg-sso%GROUP%ssofile

step input for failing step below. I replaced the sensitive details

{ "action": "create", "entityType": "root", "instanceArn": "<instanceArn>", "permissionSetArn": "<instanceArn>", "principalId": "<principalId>", "principalType": "GROUP", "targetType": "AWS_ACCOUNT", "topicArn": "<topicArn>", "pageSize": 5, "waitSeconds": 2, "supportNestedOU": "true", "tagKey": "", "emptyString": "", "tagValues": "", "ou_id": "", "resourceTypeFilters": "organizations:account", "listRootAccounts": { "Accounts": [ { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-03-01T12:11:33.059Z", "Name": "<name>", "Status": "ACTIVE" }, { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-08-09T13:14:55.607Z", "Name": "<name>", "Status": "ACTIVE" }, { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-08-14T19:10:37.341Z", "Name": "<name>", "Status": "ACTIVE" }, { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-04-28T11:13:22.576Z", "Name": "<name>", "Status": "ACTIVE" }, { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-03-01T12:08:52.478Z", "Name": "<name>", "Status": "ACTIVE" } ], "NextToken": "H4sIAAAAAAAAADVU3U6aQRB9F65709s+R1+gSbkwbWqiiRdtmqCiIr8qWP8QRSgggooKouzMPNJ5hc4c0mSymT17dvbMxxl+ZTaya+srqz8ynz5+yKxn11a+fF/5mf36efVb1rEM7AX6BPPow9pc32Fd6BR6A7uBzoNj98QNNoHtwy5g27ATWAVWhBVIKENL0AOoEy6hHlewFqwJO4XtwTZhVVgd2oBeQM+Jn3Gts8gBrBTX9S/sGvYG68GmlDejAI8FTGGJp3MiTn4mOIH2oW3oK7QDHUOH5LTjVJvQd+gCuhNhOWiLj+4ROQo9EZ6fkVnglSa3rUjskPrzMFd+DPV+vUiRdVpskHx5gIyQRpAbyC2kC0mQAaSH1IfccWs8XeITbhckTyFtbueQGeSFuIMCeWf+BNmB1CB/kCqQBiQP2YRUIbuQAqQF2YJsR6Qc5B5ShxxBLkk4J9kraFSIJ/xKk6dOO4mj1EaaIt1FC2mIdByy0zXxPnFD6sTqbUonJKVHpAnXBZlzEg6RhI2047mUWHASGlxYlD0ItckbOUJqUGSd0aLOKyYXbHAfUqTsCtcyb5WQikg15mcRfte/iT3AxuFkHfGnv4M9/nfCgm4ZwG5pGKGT3dId2mxG1zWYu+dfYSMmXrBGc5ajVKx5aIXeuw0jhbGfWaTNmhXmbtFdGqbAF7usP2aS41AMeDqhnU75LsckDOzT0WW1ARu5ht4HogJ9gL7RliNoIu6gT3EPOuPao+ZxyFO/PoyIWZhwLoa0utfPU/bS/45U6fwy3X4V3elyQJYDnqPtd5lscfar/B8o8WtskVaDbmd+/wMn6ghzcgQAAA==" } }

leelalagudu commented 2 years ago

Hi @jjleigh ,

We're unable to reproduce the root issue that you are seeing with the step functions. We've tried the below use cases in 2 of our dev environments and are unable to reproduce the issue. @jmejco had done a stellar job of setting up an AWS org with ~ 60 accounts , so that we could stress test the solution and document processing teams for average loads. These tests were done as part of this validation

As this is specific to your environment, could you please validate the following:

leelalagudu commented 2 years ago

Hi @allquixotic , we are in the process of doing more stress tests on the solution, but so far we've validated ~ 300 account assignment create/delete operations and ensured that the throttle tuning works with 0 failures.

If it's feasible for you , could you try the same bulk load flow you were trying earlier with the latest commit ID please? Only if that's feasible for you.

Thank you, Leela

jjleigh commented 2 years ago

Hi @jjleigh ,

We're unable to reproduce the root issue that you are seeing with the step functions. We've tried the below use cases in 2 of our dev environments and are unable to reproduce the issue. @jmejco had done a stellar job of setting up an AWS org with ~ 60 accounts , so that we could stress test the solution and document processing teams for average loads. These tests were done as part of this validation

  • He triggered 300 create account assignments in parallel using root scope and did not see the exception you're seeing
  • He then triggered 300 delete acccount assignments in parallel using root scope and did not see the exception you're seeing
  • On a much smaller environment ~ 6 accounts , I used the same permission set name and group name that you're using and triggered root scope assignment with pageSize reduced to 1, and the step function looped through 6 times using nextToken parameter successfully.

As this is specific to your environment, could you please validate the following:

  • Re-run a root scope account assignment and see if you're seeing the same exception. If you are , can you take the nextToken value from the execution payload and run a CLI command with that to see if that works. For ex aws organizations list-accounts --page-size 5 --next-token <your token from execution payload> ?

We tried this and got a message saying: Cannot specify --no-paginate along with pagination arguements: --page-size

so we tried without the page size flag and got the same error from the step function that I shared above.

jmejco commented 2 years ago

@jjleigh @leelalagudu

The correct argument to provide the NextToken in the CLI is --starting-token.

allquixotic commented 2 years ago

Hi, with the latest commits, I uploaded one item to links_data per second and received no rate limit issues. This is much improved from previous commits.

However, I still can reproduce two issues that may be outside the scope of this PR:

  1. When deleting a permission set attached to an account, I get an SNS error email saying "Cannot delete permission set as there are existing links that reference the permission set". Why can't it remove those links before deleting it?

  2. When uploading a new copy of the deleted permission set (deleted from S3, not from SSO) with AWS managed policies, it doesn't add the AWS managed policies.

allquixotic commented 2 years ago

There's also a race condition, where, while waiting for accounts to be deprovisioned (which can take minutes if you have dozens of accounts provisioned for the same permission set), trying to delete the permission set won't work.

So to do something like clear the links_data and permission_set data in S3 for a given permission set, you first have to delete the links_data, wait, then once it's deprovisioned, delete the permission_set file.

leelalagudu commented 2 years ago

Hi @allquixotic ,

  1. When deleting a permission set from S3 that has already been provisioned to accounts through the solution, the S3 file would be deleted as it's a user action, however an error notification is sent out similar to what you've received, about linked account assignment.
  2. When uploading the same permission set copy to S3, the solution verifies there's no delta and does not do any changes.
  3. When uploading a modified permission set to S3 (executed after step 1) , change being added managed policies, the solution calculates this delta , updates the permission set, and re-provisions the updated permission set to all the provisioned accounts.

Thank you, Leela

allquixotic commented 2 years ago

I'm OK with most of what you wrote and agree, but I tried something completely new:

First, I deleted my permission set (after removing all the account links -- I had to do 9 of them by hand, because for some reason, deleting all the links data relative to this permset didn't delete all the account links, and none of them were established by hand)

Once confirming the permission set did not exist at all in AWS SSO, I re-uploaded the .json into permission_sets, and the account links into links_data. The link was root%all%PermSet-ps%GroupNameHere%GROUP%ssofile.

Then, confusingly, I neither received an SNS error notification about this, nor did the permission set get created.

So, something is failing at a low level with this particular permission set, so that it just won't create it, and won't tell me why it can't via the usual error email.

Here is the permission set verbatim, except that something in the permissionSetName is replaced with 16 lowercase and uppercase letters that are specific to our business case.

{
  "permissionSetName": "something",
  "sessionDurationInMinutes": "720",
  "managedPoliciesArnList": [
    "arn:aws:iam::aws:policy/AdministratorAccess",
    "arn:aws:iam::aws:policy/AWSSupportAccess"
  ],
  "inlinePolicyDocument": {},
  "relayState": "https://us-east-1.console.aws.amazon.com/console/home?region=us-east-1#",
  "tags": [
    {
      "Key": "versionid", 
      "Value": "01"
    }
  ]
}

I'll note that SSO Extensions does create and provision many other permission sets successfully, but this specific one just doesn't work for some reason. I'll check the log groups you suggested earlier when I have a chance.

leelalagudu commented 2 years ago

Thank you for the updates @allquixotic . When you do get a chance, please have a look at those log files and see what's happening with this specific permission set. If the issue persists, please can you create a new issue so that this could be tracked separately.

For now, the PR would be approved and merged in the next few days as all the target stories for the PR are now handled. Do let us know if you still see issues that are in the scope of this PR i.e. nested OU support (both for provisioning & self sustaining flows), handle throttling for any organizations load, updated permission set schema definitions to align with standard AWS SSO admin API.

Thank you, Leela

allquixotic commented 2 years ago

Thanks, I agree, this PR seems to resolve the issues it claims to. No need to hold the PR for other unrelated issues. Just reporting the results of my testing.

Tell you what; to make it easier to forget about the messages in this PR comments, I'll open a new issue with my findings about that bad permission set that isn't being created or updated.