Closed davemorrow-telus closed 11 months ago
For an org with >100k resources the discovery process might be running out of memory as the default is only 2MB. To check if that is the case follwo these steps:
OutOfMemoryError: Container killed due to memory usage
You can increase the memory using the Memory
CFN parameter.
It's also quite likely you'll need to increase the DB size too, I would recommend using Neptune Serverless as the Neptune DB instance class for the initial load. You can right size the database based on the load after the initial load is done based on theDB's CPU usage. There's some info here on how to do that: https://aws-solutions.github.io/workload-discovery-on-aws/workload-discovery-on-aws/2.0/debugging-the-discovery-component.html.
There could also be other issues with the discovery process, instructions on how to get those logs are in the section of the documetnation in the section begiining with the phrase To retrieve the logs for the discovery component:.
Thanks ! This helped at least point me to the error.
The Task is erroring with a "Rate Exceeded" error.
Yes, this is an issue with rate limiting from the AWS Organizations ListAccounts
API. I am currently working on a patch to fix this issue but it will be a few weeks until it is released. I could give you a workaround but you use it you would have to download the source from GitHub, rebuild the container with the code I've given you and push it to ECR. After that you create a new task definition and tell the scheduled task to use that.
Don't worry about it. If its a known issue I can simply wait for the newer version with the fix in place.
Thanks for helping out! I will keep my eyes out for the newer version.
Question: Can the Org Unit ID be an OU or does it have to be the root ID?
My thinking here is that if we could limit discovery to a smaller OU, I might not hit the rate limiting.
Good stuff. As soon as the release is out, I will ping you in this issue. I'm also going to change the title of this issue to reflect the bug you've found, in case others come searching.
Question: Can the Org Unit ID be an OU or does it have to be the root ID?
My thinking here is that if we could limit discovery to a smaller OU, I might not hit the rate limiting.
No, it doesn't have to be the root OU, it can be any. The only issue is that you will see errors in the logs complaining about the IAM role not being deployed in the accounts in the parent OUs. This is because ListAccounts
gives all the accounts in the Org not just the ones from the non-root OU you've select. The errors won't interfere with how the discovery process works, they'll just make the logs noisy.
OK. I will give that a try and simply direct it at the OU I am most interested in for now. Thanks again for your help!
No worries!
Updated to 2.1.2 but still seeing rate limit exceeded in workload discovery task
{"msg":"Rate exceeded","stack":"ThrottlingException: Rate exceeded\n at throwDefaultError (/code/node_modules/@aws-sdk/smithy-client/dist-cjs/default-error-handler.js:8:22)\n at /code/node_modules/@aws-sdk/smithy-client/dist-cjs/default-error-handler.js:18:39\n at de_SelectAggregateResourceConfigCommandError (/code/node_modules/@aws-sdk/client-config-service/dist-cjs/protocols/Aws_json1_1.js:3873:20)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async /code/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24\n at async /code/node_modules/@aws-sdk/middleware-signing/dist-cjs/awsAuthMiddleware.js:14:20\n at async /code/node_modules/@aws-sdk/middleware-retry/dist-cjs/retryMiddleware.js:27:46\n at async /code/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:7:26\n at async makePagedClientRequest (/code/node_modules/@aws-sdk/client-config-service/dist-cjs/pagination/SelectAggregateResourceConfigPaginator.js:7:12)\n at async paginateSelectAggregateResourceConfig (/code/node_modules/@aws-sdk/client-config-service/dist-cjs/pagination/SelectAggregateResourceConfigPaginator.js:17:20)","level":"error","message":"Error in Discovery process.","timestamp":"2023-11-14T22:18:50.276Z"}
Hmmm, that's very weird. I set the throttling limit for that SelectAggregateResourceConfig
API to 8/sec which is below the the TPS limit afaik. Do you know if there are any other services/process in that account and region using that API?
@svozza None that I am aware of. It's literally a brand new account.
So in looking at the full log from the container task I notice also that it appears to have not deployed the Role to the management account (WD is deployed in a delegated account). This error appears right before the throttling error.
November 15, 2023 at 07:15 (UTC-5:00) | {"message":"Access denied assuming role: arn:aws:iam::AAAAAAAAAAAAA:role/WorkloadDiscoveryRole-BBBBBBBBBBB. This is the management account, ensure the global resources template has been deployed to the account.","level":"error","timestamp":"2023-11-15T12:15:55.988Z"}
November 15, 2023 at 07:15 (UTC-5:00) | {"level":"info","durationMs":6545,"message":"Time to get accounts","timestamp":"2023-11-15T12:15:55.989Z"}
November 15, 2023 at 07:15 (UTC-5:00) | {"message":"All active accounts from organization unit r-abcd retrieved, 102 retrieved.","level":"info","timestamp":"2023-11-15T12:15:55.754Z"}
That error won't be the cause, due to the way StackSets in Organizations works, it doesn't deploy stacks to the management account. If you go into the WD UI, you will dee a dialog box (in the Accounts page) with a link to the template and instructions to deploy it to the management account. For roughly how many minutes has the discovery process ran when it fails?
From the "Initializing" message to the "Rate Exceeded" is only about 3 mins according to the logs
Could you do me a favour and email (my address is in my profile) the account number of the account in question. I'm going to reach out to the Config team to see what's going on here. I wonder are throttling rates lower for new accounts or something.
This issue has been fixed in v2.1.3 that was released yesterday.
Describe the bug I've successfully deployed the CFN stacks using AWS Organizations settings.
To Reproduce No Resources. No accounts. Nothing.