aws-solutions / workload-discovery-on-aws

Workload Discovery on AWS is a solution to visualize AWS Cloud workloads. With it you can build, customize, and share architecture diagrams of your workloads based on live data from AWS. The solution maintains an inventory of the AWS resources across your accounts and regions, mapping their relationships and displaying them in the user interface.
https://aws.amazon.com/solutions/implementations/workload-discovery-on-aws/
Apache License 2.0
727 stars 88 forks source link

qKe Request failed with status code 403 #499

Closed sjribe closed 1 month ago

sjribe commented 10 months ago

If your issue relates to the Discovery Process, please first follow the steps described in the implementation guide Debugging the Discovery Component


Describe the bug when clicking on resources I get the error qKe Request failed with status code 403

To Reproduce Steps to reproduce the behavior:

  1. when logged in as admin click on resources under explore
  2. error message will appear on top
  3. no resources are discovered

Expected behavior resources listed

Screenshots image

Browser (please complete the following information):

reproducible on latest versions of edge and chrome

Additional context Add any other context about the problem here.

svozza commented 10 months ago

Open up your browser dev tools and paste any errors you see there into this issue.

sjribe commented 10 months ago

{ "errors" : [ { "errorType" : "WAFForbiddenException", "message" : "403 Forbidden" } ] Oh, I think I know now. Where it says "Comma separated list of CIDR ranges to manage access the API. To allow all the entire internet, use 0.0.0.0/1,128.0.0.0/1" what they mean is you should allow the internet because it needs to use the internet? If that's true what's the best way to go about fixing this without having to redo the whole thing?

svozza commented 10 months ago

Yeah, because the Fargate task speaks to AppSync, it needs to access the internet. If you just update the CFN stack and change that parameter back to 0.0.0.0/1,128.0.0.0/1, it will update it and everything will work.

sjribe commented 10 months ago

Yea, easy enough. Thanks.

Error's gone but now no resources discovered... different problem I guess...

svozza commented 10 months ago

The discovery task runs every 15 minutes, so won't run for another 5 minutes (assuming you've deployed the CloudFormation to the various accounts you want to import).

sjribe commented 10 months ago

Running it as CrossAccountDiscovery set to AWS_ORGANIZATIONS. So maybe I have the wrong OrganizationUnitId. I used the r- value for the root OU but should it be the o- value of the organization?

image
svozza commented 10 months ago

No, the r value will work. Check the ECS logs (don't worry about lambda) for any errors, instructions at think link: https://aws-solutions.github.io/workload-discovery-on-aws/workload-discovery-on-aws/2.0/debugging-the-discovery-component.html.

sjribe commented 10 months ago

Thanks. It was the r value and it did discover some resources however going through the debugging I'm getting quite a lot (22 per discovery) of: { "error": { "name": "TooManyRequestsException", "$fault": "client", "$metadata": { "httpStatusCode": 429, "requestId": "c36e55ed-eb1c-43e9-8415-b1826ee017e0", "attempts": 4, "totalRetryDelay": 1856 }, "retryAfterSeconds": null }, "level": "error", "message": "Error discovering API Gateway integration for resource: arn:aws:apigateway:us-west-2::/restapis/fqsoha0aq2/resources/89lfy2", "timestamp": "2024-01-17T23:31:05.569Z" }

I'm also getting 1: { "message": "Access denied assuming role: arn:aws:iam::922409771208:role/WorkloadDiscoveryRole-922409771208. This is the management account, ensure the global resources template has been deployed to the account.", "level": "error", "timestamp": "2024-01-17T23:30:37.747Z" } But it is true I haven't deployed the global resources template

sjribe commented 10 months ago

So some additional information:

  1. In our Audit account (682880543195), resource explorer shows 514 resources.

  2. In our Org account, resource explorer filter to the Audit account it shows 513 resources.

  3. In our Org account, Config Aggregators shows OK for 682880543195 and all the Regions show as OK.

    image image
  4. But in Config Aggregators Resources filtered to 682880543195 shows no resources.

    image

So it seems like it's connecting fine but it's not discovering anything there. And maybe there's a security option in the account 682880543195 limiting API calls? But I'm not sure where I would look for that.

svozza commented 10 months ago

In AWS_ORGANIZATIONS mode, Workload Discovery does not manage enablement of Config. We leave that down to customers as managing deployment of Config is different for every organization based on what they want to monitor and potential costs incurred by enabling it across a large number of accounts and regions. If one of your accounts doesn't have resources in it then it means Config is either not enabled in any regions in that account or as you mentioned, there is some permission error or SCP that is preventing it from doing so.

svozza commented 10 months ago

Thanks. It was the r value and it did discover some resources however going through the debugging I'm getting quite a lot (22 per discovery) of: { "error": { "name": "TooManyRequestsException", "$fault": "client", "$metadata": { "httpStatusCode": 429, "requestId": "c36e55ed-eb1c-43e9-8415-b1826ee017e0", "attempts": 4, "totalRetryDelay": 1856 }, "retryAfterSeconds": null }, "level": "error", "message": "Error discovering API Gateway integration for resource: arn:aws:apigateway:us-west-2::/restapis/fqsoha0aq2/resources/89lfy2", "timestamp": "2024-01-17T23:31:05.569Z" }

I'm also getting 1: { "message": "Access denied assuming role: arn:aws:iam::922409771208:role/WorkloadDiscoveryRole-922409771208. This is the management account, ensure the global resources template has been deployed to the account.", "level": "error", "timestamp": "2024-01-17T23:30:37.747Z" } But it is true I haven't deployed the global resources template

The API errors are because the the discovery process is being rate limited when it makes SDK calls to the API gateway SDK. API Gateway limits are account wide (rather than regional) so it there a large number of API gateway resources in an account, these sorts of throttling errors are unavoidable.

The IAM error you are seeing is because of the way organization wide StackSets work: they do not allow you to deploy a stack instance to the management account. In AWS_ORGANIZATIONS mode, the deployment process uses StakcSets to deploy the global resources stack on your behalf in all the accounts in your organization. There should be an error dialog box on the Accounts page the Workload Discovery UI that has a link to the template that you can manually deploy in the management account using CloudFormation.

sjribe commented 10 months ago

The API errors are because the the discovery process is being rate limited when it makes SDK calls to the API gateway SDK. API Gateway limits are account wide (rather than regional) so it there a large number of API gateway resources in an account, these sorts of throttling errors are unavoidable.

Is this something that AWS support can temporarily increase or lift? It looks like it's stopping at the same point each time so it's not discovering new resources. Alternatively, if I add each account in manually can I stagger the discovery for each account so as to not trigger the throttle?

The IAM error you are seeing is because of the way organization wide StackSets work: they do not allow you to deploy a stack instance to the management account. In AWS_ORGANIZATIONS mode, the deployment process uses StakcSets to deploy the global resources stack on your behalf in all the accounts in your organization. There should be an error dialog box on the Accounts page the Workload Discovery UI that has a link to the template that you can manually deploy in the management account using CloudFormation.

I installed the template and so that's sorted now.

svozza commented 10 months ago

Is this something that AWS support can temporarily increase or lift? It looks like it's stopping at the same point each time so it's not discovering new resources.

Do you mean the discovery process is crashing? Those throttling errors should only affect API Gateway, they should be skipped over and the process should move on to the next set of resources. Can you attach the ECS logs here so I can have a look?

sjribe commented 10 months ago

I don't know if the process is crashing but I do know not all of my resources are being discovered. In the account mentioned before each region shows "Not Discovered" but I know that account has 514 resources across 18 regions according to resource explorer. Or are there default resources in each region and the discovery process is filtering them out? I've attached the ECS logs for the most recent discovery job. log-events-viewer-result.csv

svozza commented 10 months ago

The discovery process in not crashing but It looks like there are only 1734 resources in the entire aggregator, that seems very low for an organization wide aggregator. When you say 'resource explorer', do you mean the service or do you mean the resource section in the AWS Config console page? Can you go to the aggregator that WD deployed (it will be called aws-perspective-<wd-region>-<wd-account-id>-aggregator and run the following query in the advanced queries section:

SELECT * WHERE accountId = '<account-id-with-514 resources'

Make sure the query scope is the aggregator as per the screenshot:

Screenshot 2024-01-18 at 23 29 50

What results do you see when you run the query?

sjribe commented 10 months ago

Yes, the service AWS resource explorer. This is viewing the account 682880543195 image

Looks like it has no output. image

svozza commented 10 months ago

The results of the SQL query means it looks like the issue is that AWS Config is not enabled in any regions in that account. Try enabling it in us-east-1 of 682880543195 and you should see IAM roles and and a few other global resource types when you run that query again (note that it can take several minutes for Config to find the resources after enablement).

If Config doesn't know about a resource there's no way for WD to discover it as we get 90% of our resources from their APIs (under the hood we also use the SQL syntax you are using there for your ad hoc query).

sjribe commented 9 months ago

Thanks. That's showing up now. Does AWS Config need to be enabled in every region in use or only one per account? For 682880543195 us-east-1 and ap-southeast-2 are in use.

svozza commented 9 months ago

Yeah, it needs to be enabled in each region you're interesting in.

sjribe commented 9 months ago

Great. That's solved most of my problems! I do have one account (949247560096) which I've enabled config on all 17 regions enabled on that account. However the discovery is only resources in 3 regions and the other regions it's saying "Not Discovered" like when config was not enabled in that region. Do you know why that would be?

svozza commented 9 months ago

That's strange. Are there any errors in the discovery process logs?

sjribe commented 8 months ago

I think I've sorted it. I did find out that Config was not enabled on the other regions but that the admin account for some reason can't add it to those regions. I've also realized there's only the default stuff in those regions without config so at the moment not necessary.

Is there a way to filter out the default resources?

svozza commented 1 month ago

Sorry, I never saw this follow up question. When you say 'default stuff', I presume you mean resources in regions where Config is not enabled that we discover using the SDK rather than getting from Config. Unfortuantely, there isn't a way to hide those in the current version of WD. I will investigate if there's a way we can do it in a upcoming version but I'm going to close this issue for now as it's not related to the original problem.