department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 198 forks source link

Document procedure for handling "expired" submission to Lighthouse Benefits Intake API #92279

Open humancompanion-usds opened 1 week ago

humancompanion-usds commented 1 week ago

Back on December 18, 2023 we had an incident which caused 2,627 submissions to Lighthouse Benefits Intake API to "expire". We performed a post-mortem on this but potentially failed to tell teams using Benefits Intake API what next steps they should take to handle their failed submissions. Thus, in the case of the Veterans Facing Forms team, some of these submissions became silent errors that were not properly processed (33). Here is the Slack thread for this incident. There was an expectation that these expirations would be retried but, that can't happen (Lighthouse doesn't have what it needs to retry). Expiration happens (from my layman understanding) when we request a UUID from Benefits Intake API but then fail to deliver our intended payload (a PDF). This happens when we hit an error in generating the PDF.

Would be great to get more eyes on this post-mortem and develop a standard procedure teams can follow and clear instructions on how to handle the "expired" cases. Also, wondering if there isn't a bit of a race condition here: We submit a request to get a UUID from Lighthouse, then we fail to deliver the PDF, then the UUID just hangs out until it expires. Ideally wouldn't we let Lighthouse know that UUID is never going to complete and can be ignored (in that we've returned an error to the user in this case)?

humancompanion-usds commented 4 days ago

@Thrillberg - Can you confirm that you think we just need to retry expired cases?

After a POST request, there is a 15-minute window during which documents must be uploaded via a PUT request. - An Expired status means the documents were not successfully uploaded within this 15-minute window. - We recommend coding to retry unsuccessful uploads within 15 minutes using the same submission in case of connection issues.

Is there no other error scenario where a retry would not work?