aws-solutions / workload-discovery-on-aws

Workload Discovery on AWS is a solution to visualize AWS Cloud workloads. With it you can build, customize, and share architecture diagrams of your workloads based on live data from AWS. The solution maintains an inventory of the AWS resources across your accounts and regions, mapping their relationships and displaying them in the user interface.
https://aws.amazon.com/solutions/implementations/workload-discovery-on-aws/
Apache License 2.0
727 stars 88 forks source link

Rate Limited Exceeded in ORGANIZATIONS mode #478

Closed davemorrow-telus closed 11 months ago

davemorrow-telus commented 1 year ago

Describe the bug I've successfully deployed the CFN stacks using AWS Organizations settings.

To Reproduce No Resources. No accounts. Nothing. stack_params WD

svozza commented 1 year ago

For an org with >100k resources the discovery process might be running out of memory as the default is only 2MB. To check if that is the case follwo these steps:

  1. Sign in to the Amazon Elastic Container Service console.
  2. Select the cluster named workload-discovery-cluster.
  3. Choose the Tasks tab.
  4. Select the Stopped button in the Desired task status panel.
  5. In the Last Status column check for the error message OutOfMemoryError: Container killed due to memory usage

You can increase the memory using the Memory CFN parameter.

It's also quite likely you'll need to increase the DB size too, I would recommend using Neptune Serverless as the Neptune DB instance class for the initial load. You can right size the database based on the load after the initial load is done based on theDB's CPU usage. There's some info here on how to do that: https://aws-solutions.github.io/workload-discovery-on-aws/workload-discovery-on-aws/2.0/debugging-the-discovery-component.html.

There could also be other issues with the discovery process, instructions on how to get those logs are in the section of the documetnation in the section begiining with the phrase To retrieve the logs for the discovery component:.

davemorrow-telus commented 1 year ago

Thanks ! This helped at least point me to the error.

The Task is erroring with a "Rate Exceeded" error. RateExceeded

svozza commented 1 year ago

Yes, this is an issue with rate limiting from the AWS Organizations ListAccounts API. I am currently working on a patch to fix this issue but it will be a few weeks until it is released. I could give you a workaround but you use it you would have to download the source from GitHub, rebuild the container with the code I've given you and push it to ECR. After that you create a new task definition and tell the scheduled task to use that.

davemorrow-telus commented 1 year ago

Don't worry about it. If its a known issue I can simply wait for the newer version with the fix in place.

Thanks for helping out! I will keep my eyes out for the newer version.

davemorrow-telus commented 1 year ago

Question: Can the Org Unit ID be an OU or does it have to be the root ID?

My thinking here is that if we could limit discovery to a smaller OU, I might not hit the rate limiting.

svozza commented 1 year ago

Good stuff. As soon as the release is out, I will ping you in this issue. I'm also going to change the title of this issue to reflect the bug you've found, in case others come searching.

svozza commented 1 year ago

Question: Can the Org Unit ID be an OU or does it have to be the root ID?

My thinking here is that if we could limit discovery to a smaller OU, I might not hit the rate limiting.

No, it doesn't have to be the root OU, it can be any. The only issue is that you will see errors in the logs complaining about the IAM role not being deployed in the accounts in the parent OUs. This is because ListAccounts gives all the accounts in the Org not just the ones from the non-root OU you've select. The errors won't interfere with how the discovery process works, they'll just make the logs noisy.

davemorrow-telus commented 1 year ago

OK. I will give that a try and simply direct it at the OU I am most interested in for now. Thanks again for your help!

svozza commented 1 year ago

No worries!

davemorrow-telus commented 1 year ago

Updated to 2.1.2 but still seeing rate limit exceeded in workload discovery task

{"msg":"Rate exceeded","stack":"ThrottlingException: Rate exceeded\n at throwDefaultError (/code/node_modules/@aws-sdk/smithy-client/dist-cjs/default-error-handler.js:8:22)\n at /code/node_modules/@aws-sdk/smithy-client/dist-cjs/default-error-handler.js:18:39\n at de_SelectAggregateResourceConfigCommandError (/code/node_modules/@aws-sdk/client-config-service/dist-cjs/protocols/Aws_json1_1.js:3873:20)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async /code/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24\n at async /code/node_modules/@aws-sdk/middleware-signing/dist-cjs/awsAuthMiddleware.js:14:20\n at async /code/node_modules/@aws-sdk/middleware-retry/dist-cjs/retryMiddleware.js:27:46\n at async /code/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:7:26\n at async makePagedClientRequest (/code/node_modules/@aws-sdk/client-config-service/dist-cjs/pagination/SelectAggregateResourceConfigPaginator.js:7:12)\n at async paginateSelectAggregateResourceConfig (/code/node_modules/@aws-sdk/client-config-service/dist-cjs/pagination/SelectAggregateResourceConfigPaginator.js:17:20)","level":"error","message":"Error in Discovery process.","timestamp":"2023-11-14T22:18:50.276Z"}

svozza commented 1 year ago

Hmmm, that's very weird. I set the throttling limit for that SelectAggregateResourceConfig API to 8/sec which is below the the TPS limit afaik. Do you know if there are any other services/process in that account and region using that API?

davemorrow-telus commented 1 year ago

@svozza None that I am aware of. It's literally a brand new account.

davemorrow-telus commented 1 year ago

So in looking at the full log from the container task I notice also that it appears to have not deployed the Role to the management account (WD is deployed in a delegated account). This error appears right before the throttling error.

November 15, 2023 at 07:15 (UTC-5:00) | {"message":"Access denied assuming role: arn:aws:iam::AAAAAAAAAAAAA:role/WorkloadDiscoveryRole-BBBBBBBBBBB. This is the management account, ensure the global resources template has been deployed to the account.","level":"error","timestamp":"2023-11-15T12:15:55.988Z"}

November 15, 2023 at 07:15 (UTC-5:00) | {"level":"info","durationMs":6545,"message":"Time to get accounts","timestamp":"2023-11-15T12:15:55.989Z"}

November 15, 2023 at 07:15 (UTC-5:00) | {"message":"All active accounts from organization unit r-abcd retrieved, 102 retrieved.","level":"info","timestamp":"2023-11-15T12:15:55.754Z"}

svozza commented 1 year ago

That error won't be the cause, due to the way StackSets in Organizations works, it doesn't deploy stacks to the management account. If you go into the WD UI, you will dee a dialog box (in the Accounts page) with a link to the template and instructions to deploy it to the management account. For roughly how many minutes has the discovery process ran when it fails?

davemorrow-telus commented 1 year ago

From the "Initializing" message to the "Rate Exceeded" is only about 3 mins according to the logs

svozza commented 1 year ago

Could you do me a favour and email (my address is in my profile) the account number of the account in question. I'm going to reach out to the Config team to see what's going on here. I wonder are throttling rates lower for new accounts or something.

svozza commented 11 months ago

This issue has been fixed in v2.1.3 that was released yesterday.