Open rmtolmach opened 3 months ago
PR is up. I added a task to the description for making a Log Monitor in Datadog. My initial assumption that we could ignore bad service ids won't work because the pd API receives the ids in bulk. I talked over a few possible solutions with Eric and I landed on this one. Catch the error and alert on the error. I need to clean up the monitor a bit on Monday and get a review of the PR + monitor.
User Story
As a developer on VA.gov, I want to delete a PagerDuty service without breaking maintenance windows for everyone.
Issue Description
The Platform was notified via a support request that maintenance windows in Staging weren't working anymore. After some research, we determined that if there was a service ID listed that was bad (i.e. didn't exist, went to a 404 page),
PollMaintenanceWindows
would fail and no maintenance windows would be set for any service, effectively breaking maintenance windows in the environment it was removed from.PR https://github.com/department-of-veterans-affairs/vsp-infra-application-manifests/pull/3027 resolved the problem by removing the two bad IDs, but there is nothing stopping this from happening again. If someone deletes a PagerDuty Service that is also listed in values.yml (search for
maintenance:
and the list ofservices
is nested under there), ALL maintenance windows in that env will stop working (this would be really bad for prod).It took a while to figure out the issue because errors were obfuscated.
VA900
is a generic error that doesn't mean anything. In order to debug we changedconn.response :raise_custom_error, error_prefix: service_name
toconn.response :raise_error, error_prefix: service_name, include_request: true
inlib/pagerduty/configuration.rb
which uncovered the real error::body=>{"error"=>{"message"=>"Service Not Found", "code"=>5002}
Note: a service id correlates to the id in the PagerDuty URL (
PY7573H
and https://dsva.pagerduty.com/service-directory/PY7573H, for example).Tasks
Acceptance Criteria
[ ] Pass validation on to BE eng.
Validation
Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.
master
.services
section nested undermaintenance
inconfig/settings.yml
, change an id to make it invalid (PXXXXXX
for example).PagerDuty::PollMaintenanceWindows.new.perform
Expected outcome: No error in the console.