department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

VR&E - check for cause of db save validation issues #83680

Open va-albers opened 3 months ago

va-albers commented 3 months ago

Determine & if appropriate eliminate the sources of errors for v0::veteran_readiness_employment_claims_controller

The "VRE API Errors" monitor show a number of 4** errors/responses from the API in the past month: image

The code in question is app/controllers/v0/veteran_readiness_employment_claims_controller.rb

If you click on the errors provided in this service view on the errors widget you see that they are all Common::Exceptions::ValidationErrors: Validation error, which validates the assertion that the error is from line 18-20 in the code.

image

If you look one of these entries up in Sentry you see a number of issues such as:

Goal:

va-albers commented 3 months ago

This is for Datadog monitor DD-186965

va-albers commented 3 months ago

List of possibly related Sentry items: link

va-albers commented 3 months ago

Datadog APM traces related to this issue all of them show as Common::Exceptions::ValidationErrors

va-albers commented 3 months ago

Related Slack thread here

va-albers commented 3 months ago

Note the current monitor already ignores 401 status responses.

tblackwe commented 3 months ago

@va-albers what is the priority of this issue?

va-albers commented 3 months ago

@tblackwe this isn't a silent failure, so that reduces the priority. if it was up to me...I would prioritize understanding the validation errors then assess how much time it takes to fix them separately?

Also I will be pulling this out of the Watch Officer rotation since the errors follow a (flawed?) backup process.

micahaspyr commented 3 months ago

Looking at Sentry, I'm seeing that this is coming from multiple sources that are also hitting the vre controller.

Image

It should be noted, however, that these validation errors are not unique to the vre controller. This happens in other controllers as well, such as the caregivers_assistance_claims controller, sentry log here

Some of these include other validation failures as well, so I'm wondering if we're seeing a larger issue on the grander scale

micahaspyr commented 3 months ago

List of possibly related Sentry items: link

@va-albers according to this sentry log, none of these claim submissions have a user that is signed in, which is required for VR&E. See below

Image

I think this could be related to a session timing out when the claim is submitted. We should check to see if the Veteran is logged in before trying to even save the claim.

micahaspyr commented 2 months ago

Sentry shows no reoccurrence of these errors since June 3rd as shown in the image capture below highlighted in blue on the right-side of the screen capture.

Image

va-albers commented 2 months ago

Looking at Sentry, I'm seeing that this is coming from multiple sources that are also hitting the vre controller.

Image

It should be noted, however, that these validation errors are not unique to the vre controller. This happens in other controllers as well, such as the caregivers_assistance_claims controller, sentry log here

Some of these include other validation failures as well, so I'm wondering if we're seeing a larger issue on the grander scale

For the purposes of this ticket you only need to worry about VRE errors.

tblackwe commented 2 months ago

We do not have an answer for this problem. It is NOT a silent error as the Veteran gets an error displayed, but we cant answer why this is happening. We can add the InProgressForm ID to the logs to try and improve our understanding

tblackwe commented 2 months ago

image

tblackwe commented 1 month ago

@va-albers Looking at these DD logs, it appears to me that the issue is focused to a specific pod.

https://vagov.ddog-gov.com/logs?query=%40name%3A%22V0%3A%3AVeteranReadinessEmploymentClaimsController%22%20%40payload.status%3A422%20&agg_m=count&agg_m_source=base&agg_t=count&cols=host%2Cservice%2C%40payload.status_message&fromUser=true&messageDisplay=inline&refresh_mode=sliding&storage=online_archives&stream_sort=%40payload.status_message%2Casc&viz=stream&from_ts=1713299083766&to_ts=1721075083766&live=true

We cant find any reason this would be a code problem within VR&E

tblackwe commented 1 month ago

it isnt all requests to that pod though All Errors on a day All requests on the pod on that day

scottsdevelopment commented 4 weeks ago

Is the data validating?

I've been looking at the logs and conversation discussed in this ticket and wanted to add that on top of observing what everyone else has discovered we attempted to run a query to validate the last 90 days of all in progress forms on production against the veteran readiness claims by matching the form id. Nothing out of the ordinary with the data that is currently being held in the in progress forms seem to cause this error with regards to what is being saved.


FORM_ID_MAPS = {
  # '21-526EZ' => Pensions::SavedClaim,
  '28-1900' => SavedClaim::VeteranReadinessEmploymentClaim
}

statistics = {
  passed: 0,
  failed: 0,
  errors: {}
}

ninety_days_ago = 90.days.ago

FORM_ID_MAPS.each do |form_id, form_class|
  InProgressForm.where(form_id: form_id).where('created_at >= ?', ninety_days_ago).each do |ipf|
    begin
      fc = form_class.new()
      fc.form = ipf.form_data

      fc.validate
      if fc.errors.empty?
        statistics[:passed] += 1
      else
        statistics[:failed] += 1
        statistics[:errors][form_id] ||= []
        statistics[:errors][form_id] << fc.errors.full_messages
      end
    rescue => e
      statistics[:failed] += 1
      statistics[:errors][form_id] ||= []
      statistics[:errors][form_id] << e.message
    end
  end
end

# Print or log the statistics
puts "Passed: #{statistics[:passed]}"
puts "Failed: #{statistics[:failed]}"
# puts "Errors: #{statistics[:errors]}"

How is this error occuring?

Did these come from front end form submission pages, based off the controller route and looking at the code, what else would these be?

When examining the logging applications we can see various user agents being used. However it is not clear where in the front end application this backend call was being made.

When we look at the body of the redacted data on the logs we see that it varies a lot and does not seem to always match to the VRE submissions.

The other data is also inconsistent and data points as referals are not clear.

What if this is pointing to something out of the ordinary with these requests such as a session timing out or a 404?

What if there is some kind of rails failure fallback pathway that leads us here?

What else can we do?

How do we capture the front end state that triggers these requests to fail and pair with back end data?

How do we rebuild this error message that is non-standard from json_schema and what does the hash mean?

What is next?

mjknight50 commented 4 weeks ago

cc @va-albers @sanjabaj2

mjknight50 commented 3 weeks ago

This issue seems very similar: https://github.com/voxpupuli/json-schema/issues/514

sanjabaj2 commented 1 week ago

@mjknight50 We had very recent occurrences of that json validation errors on VR&E. @va-albers added additional logging to VRE dashboard.
Down at Logs that match "VR&E claim was not saved" , we have seen quite a few yesterday which could be why we've seen monitor go off a lot. All on the same pod. So, the fix didn't work.