department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 204 forks source link

TUD: Setup Monitoring and Alerting for TUD #37660

Closed pjhill closed 2 years ago

pjhill commented 2 years ago

Description

Now that TUD is released to users we need to set up monitoring and alerting for the TUD instance. Consider where we will route alerts. Let's send alerts to #vsp-testing-tools-team???

Tasks

  1. Install Datadog agent on TUD server
  2. Create a monitor in Datadog that pulls data from the agent
  3. Connect #vsp-testing-tools-team Slack channel to Datadog
  4. Route alerts from TUD Datadog monitor to #vsp-testing-tools-team

Acceptance Criteria

rbeckwith-oddball commented 2 years ago

I followed the steps outlined here in attempts to setup a canary to monitor tud.

Creating the canary was simple enough, but debugging is not as straight-forward as it could be.

Issues:

  1. Following the instructions to setup the canary, the initial configuration was not able to connect to or resolve tud.vfs.va.gov (likely related to the SOCKS issue we have to utilize locally)
  2. I modified the VPC settings of the canary to be in the same subnet of tud a. Initiially resulted in a permission error since the OverlyPermissiveGenericCanary lacked necessary permissions to create a network interface. b. I modified the role's permissions as following (to be removed after testing if needed)inline_policy.png
  3. Modifying the permissions rectified the issue regarding *ERROR NAME NOT RESOLVED"
  4. The issue that occurs now is the request is timing out without ever returning any data

Additional notes:

I tried using both nodejs and python to setup the canary to no avail. Amazon has more comprehensive examples for node than Python, and the documentation for handling errors in Python / selenium is rather lacking. This should be a very basic check but getting a simple monitor for the webpage through the VPC is proving to be more intricate than initially expected.

rbeckwith-oddball commented 2 years ago

Finally got the canary to return successful results.canary.png We can see that the canary is now mostly successful, there are a few gateway timeout errors that were encountered, but that's actually a good result since we can see real issues that have occurred.

The problems ended up being:

  1. The VPC
  2. The permissions for the IAM canary user didn't include the network interface permissions
  3. A security group that was specifically allowed to communicate to vets-api containers
  4. Porting the puppeteer code over to AWS synthetics made the SSL issue a serious pain point. a. Finally found obscure documentation on AWS to allow me to connect to the server despite being a self-signed certificate

Our documentation says to have us setup a PagerDuty alert, but I'm not sure that this would be considered important enough to receive PagerDuty alerts.

rbeckwith-oddball commented 2 years ago

This is the PR for the canary after following the directions outlined here.

rbeckwith-oddball commented 2 years ago

I cleaned up all the resources associated with setting up the canary for testing:

The instructions then say to make a PR, and the canary will be created for us through the terraform code.

Problem:

Using the aws-canary-scripts will indeed create the canary, but the configuration / settings that will get created with it will not allow the new canary to function because:

Possible solutions:

  1. Use the terraform script to create the canary, then manually modify it to use the appropriate VPC, subnet, and security groups
  2. Modify the terraform script(s) to allow for us to specficy VPC, subnet, and secrutiy groups
  3. Recreate the canary script manually and use as a temporary monitoring solution until we migrate our monitors to datadog
pjhill commented 2 years ago

Thanks for that write-up. I think it's worth getting something in place now before we move to datadog. Are there examples of utilities in the same VPC that are already setup with moniors? I'm think this is already a solved problem for the Platform.

rbeckwith-oddball commented 2 years ago

Attached are the canaries that currently exist: current_canaries.png

  1. vagov-search - goes to www.va.gov, clicks search, then fails attempting to click user input. (doesn't appear to have ever worked)
  2. democanary is recent and I have never seen it functional
  3. dslogon_status simply checks the status of http://api.va.gov/v0/backend_statuses
  4. data-fetch doesn't have S3 bucket permissions correctly set, but the code goes pulls up staging.va.gov and logs in with provided username and password

None of the existing canaries appear to connect to any internally accessible endpoints. Internally accessible endpoints could be monitored with a different tool, but I don't see anything setup to use canaries.

pjhill commented 2 years ago

Ok, maybe let's park the effort to make Canaries work and pivot to Datadog in hopes that there are no road blocks there.

JoeTice commented 2 years ago

@rbeckwith-oddball - Please post updates for any progress you made yesterday, also unless the progress revealed a new option - please include a list of tasks that would outline the approach you'd take to begin looking at Option #3 above - "Recreate the canary script manually and use as a temporary monitoring solution until we migrate our monitors to datadog"

When that list of tasks is done, we can discuss re-estimating the points on this task.

rbeckwith-oddball commented 2 years ago

Option #3 details:

pjhill commented 2 years ago

@rbeckwith-oddball -- What's the detailed status of this? All of the acceptance criteria are ticked complete. Are there any remaining tasks that were discovered during the course of working this?

pjhill commented 2 years ago

This is complete.