aws-samples / aws-iam-identity-center-extensions

This solution is intended for enterprises that need a streamlined way of managing user access to their AWS accounts. Using this solution, your identity and access management teams can extend AWS SSO functionality by automating common access management and governance use cases
MIT License
65 stars 25 forks source link

Permission sets failing to create with no SNS error #90

Open allquixotic opened 2 years ago

allquixotic commented 2 years ago

I wrote up an example of this failure in the PR notes at https://github.com/aws-samples/aws-sso-extensions-for-enterprise/pull/89

This is now happening for multiple permission sets, and majorly impacting our production rollout / migration from Directory Service.

When the permission set .json already exists in S3 but the permission set does not exist in AWS SSO, I get the following from env-permissionSetTopicProcessor: status: InProgress, statusMessage: PermissionSet topic processor delete operation in progress then, status: Completed, statusMessage: PermissionSet delete operation - no reference found, so not deleting again

Afterward, if I try to re-upload it, I get:

status: InProgress, statusMessage: PermissionSet topic processor update operation in progress then, status: InProgress, statusMessage: PermissionSet update operation - Calculating delta then, status: Completed, statusMessage: PermissionSet update operation - no delta determined, completing update operation

How can there be no "delta" when the permission set does not exist in SSO?

Simplest permission set that will reproduce this problem:

{
  “permissionSetName”: “foo”, “sessionDurationInMinutes”: “720”, 
  “managedPoliciesArnList”: [“arn:aws:iam::aws:policy/AdministratorAccess”, “arn:aws:iam::aws:policy/AWSSupportAccess”],
  “inlinePolicyDocument”: {},
  “relayState”: “https://us-east-1.console.aws.amazon.com/console/home?region=us-east-1#”,
  “tags”: [ { “Key”: “something”, “Value”: “someval” } ]
}

I feel like I'm stuck without a way to proceed, because permission sets just get silently ignored by SSO Extensions for reasons it won't tell me, and all I can figure is that it has some stale state somewhere inside of SSO Extensions that isn't in current sync with what is "live" in AWS SSO.

If this is the case, SSO Extensions needs to frequently and regularly sync its current picture of the live AWS SSO state with the actual state, to account for possible changes via the SCIM API and AWS API. This permission set may have been one that I deleted by hand, but SSO Extensions should not assume that any of its cache of the AWS SSO data is current and accurate.

leelalagudu commented 2 years ago

Hi @allquixotic , ack on the issue. Let me try and reproduce the issue with the flow you described. That way, we will be able to understand where the missing/faulty logic is. I will either update you on my analysis of where the issue/might be (or) reach out to you for any questions on reproducing the bug in a few days.

With regards to the solution assuming it's cache is the source of truth, this is the current design of the solution. However, we're acutely aware of the pitfalls we would run into with this assumption (out of band changes being done directly on AWS SSO, privileged admin messing up the solution cache etc) , and had already started working on a long term fix for this.

We intend to release a "nightlyRun" feature with the solution , where the solution would run a discovery job on a nightly basis , determine if the state on AWS SSO (permission sets, account assignments) deviate from the solution cache, and based on the config option you choose , we either automatically remediate the deviation (using solution cache as source of truth) (or) send notifications on all the deviations for solution admins to manually remediate this as they see fit.

As you can see, for the "nightlyRun" feature we make the assumption that the solution cache is the state of truth and ensure that AWS SSO reflects this state of truth. In a way, this is our attempt to streamline access management in AWS SSO through the solution. Of course, we are providing all of this in a configurable way, so you could either choose not to have this "nightlyRun" feature at all / have it run in a "notification" mode only.

We were almost ready with this feature, but ran into some CDK specific compatibility issues and @vpegg is working on fixing these compatibility issues.

We would like to ask if the description of the "nightlyRun" feature above fits your requirements? If you have any feedback on how this could be improved/extended, please do let us know. We're more than happy to include any reasonable changes into the feature as we are still in development phase.

Thank you, Leela

allquixotic commented 2 years ago

The "nightlyRun" feature above may help, but it would be good to have three options for discrepancies, much like a Git merge conflict;

I hit an issue before where deleting permission sets from S3 had no effect in AWS SSO (after giving sufficient time for them to be deleted, and after making sure all the links data were purged first), so I'm not sure if this is related. For those permission sets, I had to delete them manually using the AWS management console.

From my understanding, I've done everything I possibly can to force the AWS SSO Extensions source of truth to be updated to realize that it needs to put this permission set into AWS SSO:

If this is not sufficient, can you provide a list of tasks that must be performed to get a permission set back into AWS SSO via SSO Ex if the permission set was manually deleted?

leelalagudu commented 2 years ago

Hi @allquixotic , for the current issue can you try this and update on what you are seeing please? To clarify, this is purely from a debug perspective and to help you with this specific permission set issue, and not intended to be the way you deal with orphaned permission sets in the long term.

Once you've deleted any links data, and the permission set file from S3, can you go to target account and region, and in dynamo , you will notice a table called env-permissionSetArn . I suspect you have an entry there for the problematic permission set with an arn value that is no longer valid. Can you delete this entry and the entry for the permission set from env-permissionSet table and re-try the flow for the permission set please?

allquixotic commented 2 years ago

So oddly enough, there is no partition in the env-permissionSetArnTable for the permission set I'm looking for. It does exist in env-permissionSetTable!

allquixotic commented 2 years ago

OK. I deleted the partition for the offending permission set from env-permissionSetTable, re-uploaded the file, and it imported.

Not only that, but the managed and inline policy specified in the JSON is correctly propagating to AWS SSO.

So the bug is when:

allquixotic commented 2 years ago

To completely fix all my problems of the general form of "I upload a links_data or permission_sets file to S3, but nothing happens", I had to also clear out the env-provisionedLinksTable and then re-upload all my links_data.

I also wrote a script (in Python, so you can't reuse it for your implementation, sorry) that compares what's in SSO live; what's in S3; and what's in Dynamo -- and if a permission set exists in Dynamo and S3 but not in SSO live, I tell it to delete all records relevant to that permission set from Dynamo, then I re-upload the permission set to S3.

leelalagudu commented 2 years ago

Thank you for the update Sean. I will try and reproduce this and identify the root cause. You shouldn't need to do clean up's / write auxiliary scripts for this. This is really good debug info for us, so hopefully once the fix is in, you wouldn't run into this orphaned permission set issue.

allquixotic commented 2 years ago

OK... if I completely clear out both links tables in Dynamo, clear out links_data in S3, then upload links_data in S3, SSO Ex creates the appropriate entries in the links tables in Dynamo, but it doesn't actually provision the permission set live. The permission set itself exists in both permission set tables, and the ARN in DynamoDB matches the ARN in live. Yet the permission set still says Not Provisioned.

In SQS, I have the env-linkManagerQueue.fifo and the env-linkManagerDLQ.fifo. The DLQ has 141 messages, 0 in flight. The queue has 1714 messages with 8 in flight.

When I did my links_data wipe, I didn't de-provision all the permission sets, so I will have a bunch of permission sets that are provisioned in SSO live, but don't show as provisioned in the Dynamo backend. Do you think this bug is happening because the code is trying again and again to provision links that are already provisioned, and not handling the new links that aren't provisioned?

Should I clear out both queues then try again?

allquixotic commented 2 years ago

OK, I did as I said above -- cleared out both links tables entirely; and purged both the DLQ and the FIFO linkManagerQueue. I then deleted and re-uploaded all the links_data in S3.

I ended up with:

This looks much healthier. My big concern before was that the number of messages in both queues was remaining static over many minutes (I waited over 15 minutes to see if the number of messages would change; it didn't).

A few minutes later: Now I'm seeing a decent rate of reduction in the number of messages in the FIFO. About 10 per 30 seconds, give or take.

About an hour later: The DLQ has one item in it for some reason, but eventually the FIFO processed all the pending links messages, which resulted in all of the permission sets, except one, getting provisioned correctly. Not sure what happened with the one. I have a links_data linking it to a specific account and group that I verified exists.

So when encountering this issue with links data, sometimes it looks like the queue gets stuck and you have to purge the queue also.