benkehoe / aws-sso-util

Smooth out the rough edges of AWS SSO (temporarily, until AWS makes it better).
Apache License 2.0
953 stars 72 forks source link

On AWS Organizations with many accounts populate fails. #112

Open arnvid opened 1 year ago

arnvid commented 1 year ago

We are seeing the error TooManyRequestsException when calling the ListAccountRoles operation for our AWS Organization.

cmd line used: aws-sso-util configure populate -r eu-west-1 --force-refresh -u https://d-xxxxxxxxxx.awsapps.com/start

Logging in to https://d-xxxxxxxxx.awsapps.com/start Login with IAM Identity Center required. Attempting to open the authorization page in your default browser. If the browser does not open or you wish to use a different device to authorize this request, open the following URL:

https://device.sso.eu-west-1.amazonaws.com/

Then enter the code:

XXXX-XXXX

Gathering accounts and roles Traceback (most recent call last): File "/Users/arnvid/.local/bin/aws-sso-util", line 8, in sys.exit(cli()) ^^^^^ File "/Users/arnvid/.local/pipx/venvs/aws-sso-util/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/arnvid/.local/pipx/venvs/aws-sso-util/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/Users/arnvid/.local/pipx/venvs/aws-sso-util/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/arnvid/.local/pipx/venvs/aws-sso-util/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/arnvid/.local/pipx/venvs/aws-sso-util/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/arnvid/.local/pipx/venvs/aws-sso-util/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/arnvid/.local/pipx/venvs/aws-sso-util/lib/python3.11/site-packages/aws_sso_util/populate_profiles.py", line 342, in populate_profiles response = client.list_account_roles(list_role_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/arnvid/.local/pipx/venvs/aws-sso-util/lib/python3.11/site-packages/botocore/client.py", line 535, in _api_call return self._make_api_call(operation_name, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/arnvid/.local/pipx/venvs/aws-sso-util/lib/python3.11/site-packages/botocore/client.py", line 980, in _make_api_call raise error_class(parsed_response, operation_name) botocore.errorfactory.TooManyRequestsException: An error occurred (TooManyRequestsException) when calling the ListAccountRoles operation (reached max retries: 4): HTTP 429 Unknown Code

iainelder commented 1 year ago

@arnvid How often does it happen to you?

The biggest identity center instance I work with just now gives me about 200 roles.

Here aws-sso-util sometimes gives me the same error. It almost always works when I retry the command.

iainelder commented 1 year ago

As far as I can tell aws-sso-util doesn't do anything that should obviously exceed a rate limit.

It instantiates the SSO client ẁith implicit default retry handling.

https://github.com/benkehoe/aws-sso-util/blob/9290b8436d673be1b85f22c1e0e37ef332200d8b/cli/src/aws_sso_util/populate_profiles.py#L305-L309

It starts a loop to call ListAccountRoles.

https://github.com/benkehoe/aws-sso-util/blob/9290b8436d673be1b85f22c1e0e37ef332200d8b/cli/src/aws_sso_util/populate_profiles.py#L341-L342

It continues the loop until there are no more result pages.

https://github.com/benkehoe/aws-sso-util/blob/9290b8436d673be1b85f22c1e0e37ef332200d8b/cli/src/aws_sso_util/populate_profiles.py#L361-L365

The Identity Center documentation says its APIs have a collective throttle maximum of 20 transactions per second. I'm unsure what that means in practice. Does that limit apply to all users of the ListAccountRoles API? It seems like a low limit.

iainelder commented 1 year ago

You may be able to avoid the throttling errors by setting environment variables to control the SDK retry behavior.

I'd try something like this:

export AWS_RETRY_MODE=standard AWS_MAX_ATTEMPTS=100

The standard retry mode classes HTTP status code 429 as a transient error and so would automatically retry.

arnvid commented 1 year ago

@arnvid How often does it happen to you?

The biggest identity center instance I work with just now gives me about 200 roles.

Here aws-sso-util sometimes gives me the same error. It almost always works when I retry the command.

It happends everytime on our production SSO. About 396 profiles before adding PIM'd roles.

arnvid commented 1 year ago

With these added I can get through: ➜ ~ export AWS_RETRY_MODE=standard ➜ ~ export AWS_MAX_ATTEMPTS=100

Gathering accounts and roles Writing 399 profiles to /Users/arnvid/.aws/config

iainelder commented 1 year ago

Thanks for confirming that those environment variables allow you to write the profiles.

And thanks for sharing info about the number of roles you have. My guess is that it's more likely to happen with a longer list of roles.

I think the next step would be to set up a lab environment with a variable number of roles between 100 and 1000 and see whether it's more likely at the bigger end of the scale.

If someone can reproduce the throttling error in a lab environment then maybe they could adjust the paging behavior to work without needing the user to set any environment variables.

iainelder commented 11 months ago

Someone reported the same problem in https://github.com/benkehoe/aws-sso-util/issues/97#issuecomment-1772118296.

Earlier I proposed this solution:

If someone can reproduce the throttling error in a lab environment then maybe they could adjust the paging behavior to work without needing the user to set any environment variables.

Nice idea, but it sounds like a lot of work to compensate for bad API behavior on the AWS side.

I propose we configure the client that calls ListAccountProfiles with the same retry behavior effected by the environment variables so that no one has to think about this.

benkehoe commented 10 months ago

I'm nearing the end of my time off, and I plan on fully re-engaging with all of my projects, but realistically it means nothing is going to be addressed until early next year.

benkehoe commented 4 months ago

Finally back to this. I think the right way to solve this is to spread out the calls to match the right API rate limit. Do we know what that is?

iainelder commented 4 months ago

The Identity Center Quotas page says only this about rates:

IAM Identity Center APIs have a collective throttle maximum of 20 transactions per second (TPS). The CreateAccountAssignment has a maximum rate of 10 outstanding async calls. These quotas cannot be changed.

It's not clear to me whether 20 TPS applies only to the instance APIs or also to the Portal APIs.

Even if it does apply to the Portal, a single client's limit can be a lot less than 20 TPS.

It's not clear to me what "collective throttle maximum" means. Is it one quota for all clients of one Identity Center instance?

Any way to get a clarification from the Identity Center service team on this?

iainelder commented 4 months ago

a single client's limit can be a lot less than 20 TPS.

I haven't measured that. It was just a guess and it may be wrong.

For what it's worth, Granted uses a rate limit of 20 TPS to call ListAccountRoles.

// Setting the rate limit to 20 since IAM Identity Center APIs have a throttle maximum of 20 transactions per second (TPS) (https://docs.aws.amazon.com/singlesignon/latest/userguide/limits.html)
rl := uberratelimit.New(20)

I can't read Go well enough to understand how it handles throttling errors.