department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 204 forks source link

Deleted PagerDuty services causes maintenance windows to not work #89547

Open rmtolmach opened 3 months ago

rmtolmach commented 3 months ago

User Story

As a developer on VA.gov, I want to delete a PagerDuty service without breaking maintenance windows for everyone.

Issue Description

The Platform was notified via a support request that maintenance windows in Staging weren't working anymore. After some research, we determined that if there was a service ID listed that was bad (i.e. didn't exist, went to a 404 page), PollMaintenanceWindows would fail and no maintenance windows would be set for any service, effectively breaking maintenance windows in the environment it was removed from.

PR https://github.com/department-of-veterans-affairs/vsp-infra-application-manifests/pull/3027 resolved the problem by removing the two bad IDs, but there is nothing stopping this from happening again. If someone deletes a PagerDuty Service that is also listed in values.yml (search for maintenance: and the list of services is nested under there), ALL maintenance windows in that env will stop working (this would be really bad for prod).

It took a while to figure out the issue because errors were obfuscated.

irb(main):002> PagerDuty::PollMaintenanceWindows.new.perform
{"host":"vets-api-web-5bb6ddc48b-md425","application":"vets-api-server","environment":"production","timestamp":"2024-07-26T15:44:45.108318Z","level":"error","level_index":4,"pid":21323,"thread":"12320","file":"/app/lib/common/client/base.rb","line":110,"named_tags":{"dd":{"env":"eks-staging","service":"vets-api","version":"0e7023936b369513a1ac69b3808b1d8cf06ce531","trace_id":"0","span_id":"0"},"ddsource":"ruby"},"name":"Rails","message":"BackendServiceException: {:status=>400, :detail=>nil, :code=>\"VA900\", :source=>nil}","payload":{"title":"Operation failed","detail":"Operation failed","code":"VA900","status":"400"}}

VA900 is a generic error that doesn't mean anything. In order to debug we changed conn.response :raise_custom_error, error_prefix: service_name to conn.response :raise_error, error_prefix: service_name, include_request: true in lib/pagerduty/configuration.rb which uncovered the real error: :body=>{"error"=>{"message"=>"Service Not Found", "code"=>5002}

Note: a service id correlates to the id in the PagerDuty URL (PY7573H and https://dsva.pagerduty.com/service-directory/PY7573H, for example).

Tasks

Acceptance Criteria

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

  1. After PR is merged, pull in latest code from master.
  2. in the services section nested under maintenance in config/settings.yml, change an id to make it invalid (PXXXXXX for example).
  3. in the rails console, run PagerDuty::PollMaintenanceWindows.new.perform

Expected outcome: No error in the console.

rmtolmach commented 1 day ago

PR is up. I added a task to the description for making a Log Monitor in Datadog. My initial assumption that we could ignore bad service ids won't work because the pd API receives the ids in bulk. I talked over a few possible solutions with Eric and I landed on this one. Catch the error and alert on the error. I need to clean up the monitor a bit on Monday and get a review of the PR + monitor.