department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 198 forks source link

Refactor Error Handling In Appointments Response #90026

Open cferris32 opened 1 month ago

cferris32 commented 1 month ago

We have recently started seeing Datadog error logs pop up where the logs aren't being parsed correctly.

A note from Joe Niquette on the issue: "I am seeing some errors from what appears to be the appointments MAP integration where the logs arent being parsed correctly. Looks like it started about 5 days ago: https://vagov.ddog-gov.com/logs?query=%28%28SecurityToken%20service%20success%20env%3A[…]=stream&from_ts=1722862610492&to_ts=1722877010492&live=true"

More details from an Identity Team Engineer: "It’s hard to say without being familiar with their service, but I’d say they rescuing exceptions in the get_appointments method would be a good start. Right now, they don’t rescue any errors in that method; instead, they merge any errors they receive into the value for meta which is included in the return value (a hash) of the method:

{
  data: deserialized_appointments(appointments),
  meta: pagination(pagination_params).merge(partial_errors(response))
}

I would refactor that method to rescue specific exceptions (similar to how we do here), log a specific message for that exception, and then reraise the error or raise an error class specific to their service). They’d need to also refactor their controller to rescue that error and return the error response."

Here is their service.rb as an example for reference.

We would like to refactor our error handling to align with these practices to clean up the logs and offer better visibility into errors we receive.

AC: [ ] - appointments response error handling checks for and rescues from specific categories [ ] - tests updated

JunTaoLuo commented 5 days ago

Although I have some work in progress for this ticket, testing the changes has turned out to be a bit trickier than expected and I think it will conflict with the changes that are being made for unrelated backend tickets 91164 and 90790. I want to be realistic with the timelines and say we are likely to move this ticket into the next sprint (Sprint 16) since it's unlikely to be completed before the end of the sprint.