aws-solutions / workload-discovery-on-aws

Workload Discovery on AWS is a solution to visualize AWS Cloud workloads. With it you can build, customize, and share architecture diagrams of your workloads based on live data from AWS. The solution maintains an inventory of the AWS resources across your accounts and regions, mapping their relationships and displaying them in the user interface.
https://aws.amazon.com/solutions/implementations/workload-discovery-on-aws/
Apache License 2.0
727 stars 88 forks source link

More helpful error messages. #519

Open pardueaws opened 6 months ago

pardueaws commented 6 months ago

Feature name Meaningful error messages.

Is your feature request related to a problem? Please describe. Customer has deployed Workload Discovery, but is not seeing all of their resources. We have found errors in the GremlinAppSync file (when searching for #500) but the error message is not helpful.

Describe the feature you'd like to see implemented Can the errors provide more information about what exactly is failing in the discovery service?

Describe the value this feature will add to AWS Perspective This would be helpful when users have problems with discovery..

svozza commented 6 months ago

Errors around resource missing will be in the ECS logs, the instructions are at the bottom of the page section titled To retrieve the logs for the discovery component.: (https://aws-solutions.github.io/workload-discovery-on-aws/workload-discovery-on-aws/2.0/debugging-the-discovery-component.html).

There is also an extensive flowchart for diagnosing common issues in the troubleshooting section of the README:

(https://aws-solutions.github.io/workload-discovery-on-aws/workload-discovery-on-aws/2.0/debugging-the-discovery-component.html).

Out of interest, has this been deployed in AWS_ORGANIZATION mode? There is a known issue with writes to OpenSearch being dropped on the very first ingestion cycle the discovery process does, which would appear in the UI as missing resources.

pardueaws commented 6 months ago

Yes, AWS_ORGANIZATION mode.

Sent from my iPhone

On May 2, 2024, at 5:26 PM, Stefano Vozza @.***> wrote:



Errors around resource missing will be in the ECS logs, the instructions are at the bottom of the page section titled To retrieve the logs for the discovery component.: (https://aws-solutions.github.io/workload-discovery-on-aws/workload-discovery-on-aws/2.0/debugging-the-discovery-component.html).

There is also an extensive flowchart for diagnosing common issues in the troubleshooting section of the README:

(https://aws-solutions.github.io/workload-discovery-on-aws/workload-discovery-on-aws/2.0/debugging-the-discovery-component.html).

Out of interest, has this been deployed in AWS_ORGANIZATION mode? There is a known issue with writes to OpenSearch being dropped on the very first ingestion cycle the discovery process does, which would appear in the UI as missing resources.

— Reply to this email directly, view it on GitHubhttps://github.com/aws-solutions/workload-discovery-on-aws/issues/519#issuecomment-2091705724, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BBR2EQLJJBENXJGEOMG5PO3ZAKVPZAVCNFSM6AAAAABHELDKCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJRG4YDKNZSGQ. You are receiving this because you authored the thread.Message ID: @.***>

svozza commented 6 months ago

Then it's very likely the last issue I mentioned. To verify:

  1. identify a resource that is missing, for example, an EC2 instance and get the ARN.
  2. Log into the AppSync console and select the Workload Discovery GraphQL API.
  3. Choose Queries from the side panel.
  4. Choose the Login with User Pools button and authenticate with your WD password.
  5. Execute the following GraphQL query with the ARN from step 1:
    query MyQuery {
    getResourceGraph(ids: ["<your-arn>"]) {
    edges {
      id
    }
    nodes {
      id
    }
    }
    }
  6. Any successful response that isn't empty as below means that the resources are in Neptune but not OpenSearch:
    {
    "data": {
    "getResourceGraph": {
      "edges": [],
      "nodes": []
    }
    }
    }

The simplest way to rectify this is to clear the Neptune database and when the discovery process runs again, it will repopulate both databases:

  1. Log into the lambda console.
  2. Find the lambda function that writes to Neptune, it will have a name such as <stack-name>-GremlinAppSyncFunction-<ID-string>.
  3. Select the Test tab and create a test event with the following JSON:
    {
    "arguments": {
    },
    "source": null,
    "prev": null,
    "info": {
    "parentTypeName": "Mutation",
    "fieldName": "deleteAllResources",
    "variables": {}
    },
    "stash": {}
    }
  4. Execute the test event. Depending on how many resources are in Neptune, the lambda function may time out but it should still clear the DB.
  5. Wait for the discovery process to run again in 15 minutes and re-ingest the resources.