duo-labs / cloudmapper

CloudMapper helps you analyze your Amazon Web Services (AWS) environments.
BSD 3-Clause "New" or "Revised" License
5.99k stars 807 forks source link

AWS API Limits causing Throttling and RequestLimitExceeded errors #598

Closed jleopold28 closed 4 years ago

jleopold28 commented 4 years ago

I am running cloudmapper in a large AWS account that is constantly running close to the API limit. I am getting errors with the collect command that I believe would be fixed by increasing the boto retry limit. Is there a way I can increase the default retries of 4 with the current cloudmapper setup?

We ran into a similar issue when running awslimitchecker and fixed the problem by increasing the retries. Here is a link to that https://github.com/jantman/awslimitchecker/pull/445/files#diff-ae1d60720c71355da3fddb7fa8bac222R101

I am willing to cut a PR with a change that would increase the botocore maxretries.

Here is some sanitized output of my issue:

Summary: 68060 APIs called. 66 errors
Failures:
  rds.describe_db_snapshot_attributes({'DBSnapshotIdentifier': 'XXX'}): An error occurred (Throttling) when calling the DescribeDBSnapshotAttributes operation (reached max retries: 4): Rate exceeded
  autoscaling.describe_policies({}): An error occurred (Throttling) when calling the DescribePolicies operation (reached max retries: 4): CloudWatchAlarm Rate exceeded
  cloudformation.get_template({'StackName': 'XXX'}): An error occurred (Throttling) when calling the GetTemplate operation (reached max retries: 4): Rate exceeded
  cloudformation.get_template({'StackName': 'XXX'}): An error occurred (Throttling) when calling the GetTemplate operation (reached max retries: 4): Rate exceeded
  cloudformation.get_template({'StackName': 'XXX'}): An error occurred (Throttling) when calling the GetTemplate operation (reached max retries: 4): Rate exceeded
  cloudformation.get_template({'StackName': 'XXX'}): An error occurred (Throttling) when calling the GetTemplate operation (reached max retries: 4): Rate exceeded
  cloudformation.describe_stack_resources({'StackName': 'XXX'}): An error occurred (Throttling) when calling the DescribeStackResources operation (reached max retries: 4): Rate exceeded
  cloudformation.describe_stack_resources({'StackName': 'XXX'}): An error occurred (Throttling) when calling the DescribeStackResources operation (reached max retries: 4): Rate exceeded
  cloudformation.describe_stack_resources({'StackName': 'XXX'}): An error occurred (Throttling) when calling the DescribeStackResources operation (reached max retries: 4): Rate exceeded
  cloudformation.describe_stack_resources({'StackName': 'XXX'}): An error occurred (Throttling) when calling the DescribeStackResources operation (reached max retries: 4): Rate exceeded
  cloudformation.describe_stack_resources({'StackName': 'XXX'}): An error occurred (Throttling) when calling the DescribeStackResources operation (reached max retries: 4): Rate exceeded
  ec2.describe_snapshot_attribute({'SnapshotId': 'XXX', 'Attribute': 'createVolumePermission'}): An error occurred (RequestLimitExceeded) when calling the DescribeSnapshotAttribute operation (reached max retries: 4): Request limit exceeded.    
  ec2.describe_snapshot_attribute({'SnapshotId': 'XXX', 'Attribute': 'createVolumePermission'}): An error occurred (RequestLimitExceeded) when calling the DescribeSnapshotAttribute operation (reached max retries: 4): Request limit exceeded.    
  ec2.describe_snapshot_attribute({'SnapshotId': 'XXX', 'Attribute': 'createVolumePermission'}): An error occurred (RequestLimitExceeded) when calling the DescribeSnapshotAttribute operation (reached max retries: 4): Request limit exceeded.    
  ec2.describe_snapshot_attribute({'SnapshotId': 'XXX', 'Attribute': 'createVolumePermission'}): An error occurred (RequestLimitExceeded) when calling the DescribeSnapshotAttribute operation (reached max retries: 4): Request limit exceeded.    
  ecs.describe_tasks({'cluster': 'arn:aws:ecs:XXX', 'tasks': ['XXX']}): An error occurred (ThrottlingException) when calling the DescribeTasks operation (reached max retries: 4): Rate exceeded
0xdabbad00 commented 4 years ago

My gut reaction is that if you're running into this much rate limiting then you should treat the cause of that problem as opposed to treating the symptom being expressed by CloudMapper. What I mean by this is it would seem to indicate that you have something else running wild in your environment that needs to be better tuned, or you should talk with AWS about bumping up some type of limits as I've not heard of this being a problem for people.

jantman commented 4 years ago

@0xdabbad00

Scott,

I work with @jleopold28 (full disclosure, I'm also the maintainer of awslimitchecker) and have a bit more detail on our issues. We've worked with Enterprise Support, as well as some of the service teams, on these throttling issues multiple times. They do admit that it's a relatively rare problem, but we've had most of our API rate limits increased to the maximum. The one particular account and region where we're seeing this has over 1,000 Elastic Beanstalk environments and 3,500 CloudFormation Stacks. The vast majority of the API queries in this account are made by Beanstalk itself, not any third-party tooling (Beanstalk environments themselves - health checking, etc. - still count against API rate limiting).

Looking at https://github.com/boto/botocore/pull/1260 where botocore exposed max_retries configuration to users via the Config object, and the issues and other PRs linked there, there's quite a bit of evidence there that other people also experience API rate limiting... though perhaps no other cloudmapper users do, or no other users have found enough value in cloudmapper to strongly desire a fix.

We'd be more than happy to open a pull request with a simple fix for this, likely via environment variables, but wanted to see if you have any feelings on implementation. If not, we'll end up just running off of a fork with this one fix. My hope is that a fix could be incorporated upstream just in case anyone else attempts to run cloudmapper in accounts that can't tolerate much fast listing of resources without hitting rate limits.

Thanks, Jason

0xdabbad00 commented 4 years ago

Thank you for the explanation @jantman and thank you for pointing out that botocore has exposed the max_retries config. I'll merge in a PR if you send one to expose a config option of some sort for this. I'm surprised boto doesn't just pull one from the environment, but given that they don't I think adding it as a command-line option to collect would make sense in this part of the code: https://github.com/duo-labs/cloudmapper/blob/ecc8e0153a6366d04faecaa4897982943764568e/commands/collect.py#L476

jantman commented 4 years ago

Thanks so much, @0xdabbad00. We'll get to work on a PR for that.

jleopold28 commented 4 years ago

@0xdabbad00 I have opened a PR to support boto max attempts. https://github.com/duo-labs/cloudmapper/pull/614

0xdabbad00 commented 4 years ago

Thanks! I've merged it. I should be cutting a new release this week as I need to update the CDK for the nightly auditor.