aws-samples / aws-iam-identity-center-extensions

This solution is intended for enterprises that need a streamlined way of managing user access to their AWS accounts. Using this solution, your identity and access management teams can extend AWS SSO functionality by automating common access management and governance use cases
MIT License
65 stars 24 forks source link

Unhandled exceptions when upgrading to 3.1.7 #105

Closed allquixotic closed 1 year ago

allquixotic commented 2 years ago

When upgrading from 3.1.5 to 3.1.7 (by just pushing the latest, unmodified code to my codecommit and adding the new required parameters to env.yaml) I now receive the following sequence of errors in env-permissionSetTopicProcessor:

Initiating permission set CRUD logic Resolved instanceArn as arn:aws:sso:::instance/ssoins-(numbers and letters redacted) Determined permission set operation is of type update Determined that permission set exists For requestID: (GUID) , exception when exception name -> Unhandled exception occurred. Exception message is -> {} . Related data for the exception -> (my permission set name)

It does this for roughly 2/3 of my permission sets. No idea why. Another variant goes like this without error:

Initiating permission set CRUD logic Resolved instanceArn as arn:aws:sso:::instance/ssoins-(numbers and letters redacted) Determined permission set operation is of type update Determined that permission set exists calculating delta for permissionSet update operation No delta determined for permissionSet update operation, completing update operation

I don't know how to debug this any further. This is with FunctionLogMode: "Debug".

For the record, our deployment used to have discrepancies between Dynamo, S3 and SSO live for a lot of permission sets, but we completely wiped out Dynamo and AWS SSO permission sets and completely re-imported them from scratch, so now there really shouldn't be any permission sets that are inconsistent between Dynamo and S3 and SSO except for permission sets that apply to the OrgMain account, which we "hand jammed" in SSO without creating it in S3.

Also, none of the permission sets that errored are for all accounts or for the OrgMain account -- they are just meant for individual accounts. These permission sets worked just fine on 3.1.5.

leelalagudu commented 2 years ago

Hi @allquixotic , ACK on the issue. Can you confirm if this is valid for any of the problematic permission sets - either their old state / new state does not have any managed Policies attached?

allquixotic commented 2 years ago

I figured out some more information:

The update to env-aws-sso-extensions-for-enterprise-ssoImportArtefactsPart2Stack failed when I did the update from 3.1.5 to 3.1.7:

envparentSMResource was UPDATE_FAILED because

Received response status [FAILED] from custom resource. Message returned: error Logs: /aws/lambda/env-updateCustomResourceHandler at Runtime.E [as handler] (/var/taskindex.js:1:2656) at processTicksAndRejections (node:internal/process/task_queues:96:5) (RequestId ... etc)

This is part of the CodePipeline stage env-aws-sso-extensions-for-enterprise-ssoImportArtefactsPart2Stack.Prepare

Tracing to that log file, we have:

{
  "handler": "updateCustomResource",
  "logMode": "Exception",
  ...
  "status": "FailedWithError",
  "statusMessage": "Custom resource update - stateMachine with execution arn: arn:aws:states:us-east-1:(one of our member accounts):execution:env-imnportCurrentConfigSM:(a GUID) failed with error. See error details in (member account) account, us-east-1 region"
}

So with the code pipeline failing, it's not too surprising that then the solution doesn't work right! There's one step in the code pipeline that didn't even execute because this one failed.

allquixotic commented 2 years ago

Also: I confirmed that failing permissionSetTopicProcessor only happens when the permission set does not have a managedPoliciesArnList key at all. Permission sets that seem to work have a managedPoliciesArnList array in the permission set definition.

allquixotic commented 2 years ago

Further info: just defining a managedPoliciesArnList array - with empty contents - does not solve the issue. I have to include at least one AWS managed policy ARN in the array for it to accept it.

leelalagudu commented 2 years ago

Hi @allquixotic , thank you for the very detailed hypothesis. I think I know where the issue is . https://github.com/aws-samples/aws-iam-identity-center-extensions/blob/main/lib/lambda-functions/application-handlers/src/permissionSetTopicProcessor.ts#L522 in the code does not handle the scenario where the permission set (old/new version) might have an empty managedPoliciesArnList array, which is causing this unhandled exception.

I will push the fix for this , once the name updates PR is approved.

The updateCustomResource handler issue however is a bit unclear to me. For the moment I am assuming that this is a regression coming out of the current bug, but let us get to it once the fix for this is put in place.

Do you need any other data/info at this time?

allquixotic commented 2 years ago

I've given my team workaround instructions for the moment, having them pick a "strawman" (redundant) managed policy to assign, to clear the errors in the permissionSetTopicProcessor and allow us to do permission set updates. It's not ideal, but we are up and running with our day to day right now.

That gives us time to figure out a proper fix to this problem. Also, please note that I can still reproduce this problem if the managedPoliciesArnList key is entirely omitted from the permission set definition in S3.

I think the updateCustomResource handler problem is possibly related to this, but also maybe not. Seeing the pipeline fail to complete after a Git push speaks to me more as a regression in the conflict resolution between Dynamo / S3 / "live" SSO.

It might become the case that the only proper way to address my updateCustomResource problem is to get the code developed and merged that will eliminate the Dynamo layer entirely, and just use AWS SSO live state as the single source of truth on the service side. #94

I'm not looking forward to it, but one possible solution I might have to enact is to clear out the Dynamo tables, clear out the SQS queues, clear out AWS SSO itself (delete all principals, de-provision and then delete all permission sets), download all the S3 permission sets and links data to my system and then clear out the bucket, re-deploy the pipeline (which should succeed since there will be no state to work on), then re-upload all the permission sets and links data to S3, and re-import all the principals. Gross, and will certainly lead to outage, so I'll have to do it on a Sunday morning or something.

leelalagudu commented 1 year ago

I figured out some more information:

The update to env-aws-sso-extensions-for-enterprise-ssoImportArtefactsPart2Stack failed when I did the update from 3.1.5 to 3.1.7:

envparentSMResource was UPDATE_FAILED because

Received response status [FAILED] from custom resource. Message returned: error Logs: /aws/lambda/env-updateCustomResourceHandler at Runtime.E [as handler] (/var/taskindex.js:1:2656) at processTicksAndRejections (node:internal/process/task_queues:96:5) (RequestId ... etc)

This is part of the CodePipeline stage env-aws-sso-extensions-for-enterprise-ssoImportArtefactsPart2Stack.Prepare

Tracing to that log file, we have:

{
  "handler": "updateCustomResource",
  "logMode": "Exception",
  ...
  "status": "FailedWithError",
  "statusMessage": "Custom resource update - stateMachine with execution arn: arn:aws:states:us-east-1:(one of our member accounts):execution:env-imnportCurrentConfigSM:(a GUID) failed with error. See error details in (member account) account, us-east-1 region"
}

So with the code pipeline failing, it's not too surprising that then the solution doesn't work right! There's one step in the code pipeline that didn't even execute because this one failed.

Hi @allquixotic , can you navigate to your orgMain account and look at the last run of env-importCurrentConfigSM and see where the state machine failed? The symptom you're seeing is when the current configuration import state machine is triggered, but has not completed successfully. Hence the ask to validate what/where the state machine is failing.

allquixotic commented 1 year ago

I looked through some of the state machine logs, but couldn't find anything distinct in terms of any errors. I saw a generic error in importCurrentConfigSM that some downstream state machine had failed, but that one was looping over everything and spitting out tens of thousands of lines of logs, and I couldn't find the root cause.

I just ended up turning off the import feature and that resolved my pipeline issues.