Closed pjhill closed 2 years ago
I followed the steps outlined here in attempts to setup a canary to monitor tud
.
Creating the canary was simple enough, but debugging is not as straight-forward as it could be.
tud.vfs.va.gov
(likely related to the SOCKS issue we have to utilize locally)tud
a. Initiially resulted in a permission error since the OverlyPermissiveGenericCanary
lacked necessary permissions to create a network interface.
b. I modified the role's permissions as following (to be removed after testing if needed)I tried using both nodejs
and python
to setup the canary to no avail. Amazon has more comprehensive examples for node than Python, and the documentation for handling errors in Python / selenium is rather lacking. This should be a very basic check but getting a simple monitor for the webpage through the VPC is proving to be more intricate than initially expected.
Finally got the canary to return successful results. We can see that the canary is now mostly successful, there are a few gateway timeout errors that were encountered, but that's actually a good result since we can see real issues that have occurred.
The problems ended up being:
puppeteer
code over to AWS synthetics made the SSL issue a serious pain point.
a. Finally found obscure documentation on AWS to allow me to connect to the server despite being a self-signed certificateOur documentation says to have us setup a PagerDuty alert, but I'm not sure that this would be considered important enough to receive PagerDuty alerts.
I cleaned up all the resources associated with setting up the canary for testing:
The instructions then say to make a PR, and the canary will be created for us through the terraform code.
Using the aws-canary-scripts will indeed create the canary, but the configuration / settings that will get created with it will not allow the new canary to function because:
vets-api
datadog
Thanks for that write-up. I think it's worth getting something in place now before we move to datadog
. Are there examples of utilities in the same VPC that are already setup with moniors? I'm think this is already a solved problem for the Platform.
Attached are the canaries that currently exist:
http://api.va.gov/v0/backend_statuses
staging.va.gov
and logs in with provided username and passwordNone of the existing canaries appear to connect to any internally accessible endpoints. Internally accessible endpoints could be monitored with a different tool, but I don't see anything setup to use canaries.
Ok, maybe let's park the effort to make Canaries work and pivot to Datadog in hopes that there are no road blocks there.
@rbeckwith-oddball - Please post updates for any progress you made yesterday, also unless the progress revealed a new option - please include a list of tasks that would outline the approach you'd take to begin looking at Option #3 above - "Recreate the canary script manually and use as a temporary monitoring solution until we migrate our monitors to datadog"
When that list of tasks is done, we can discuss re-estimating the points on this task.
Option #3 details:
DataDog
drivenSlack
alert for the canary to alert us if tud is indeed having issues.@rbeckwith-oddball -- What's the detailed status of this? All of the acceptance criteria are ticked complete. Are there any remaining tasks that were discovered during the course of working this?
This is complete.
Description
Now that TUD is released to users we need to set up monitoring and alerting for the TUD instance. Consider where we will route alerts. Let's send alerts to #vsp-testing-tools-team???
Tasks
Acceptance Criteria