department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
284 stars 206 forks source link

[BE] 10-10CG - BUGs: Investigate JSON Schema 500 errors on #create controller #95897

Open hdjustice opened 1 month ago

hdjustice commented 1 month ago

Background

Reviewing for #download errors on the 10-10CG form, it was found that there are some different 500 errors coming up on the V0::CaregiversAssistanceClaimsController#create resource. They are referring to some sort of JSON Schema issue. We need to investigate these errors, determine root cause and if there is any solution we can put in place to either resolve on our side, or guide the user to avoid the errors. We may also need to work with the CG team to confirm the schema requirements match up.

Resources

Tasks

Acceptance Criteria

coope93 commented 1 month ago

Spent some time looking into this today. These all look to be caused by this error:

Read of file at /usr/local/bundle/cache/ruby/3.3.0/gems/ruby-saml-1.17.0/lib/schemas/2909da20-6ab0-595f-93cf-fdc3e923cdd2 failed.

I want to note that the uuid in the directory path is consistent between all failures 2909da20-6ab0-595f-93cf-fdc3e923cdd2.

This error is occuring in the saved_claim model and form_matches_schema method (which is shared with other claims/forms), and it looks like that utilzes a json-schema gem.

Additionally, it looks like similar errors are happening for another team is running into the same issue. Their ticket is here.

Planning to look at implementing some more logging in our form validation logic to see if we can get more context on the error.

coope93 commented 1 month ago

This datadog query shows intermittent occurrences for seven different forms with distinct service tags having this same issue in the form_matches_schema method in the SavedClaim class when using the json-schema gem. This chart shows the forms and their occurrences. It looks like we are intermittently unable to load the schema file to validate against.

coope93 commented 1 month ago

Reached out to platform support in slack to see if they have any insight on this. slack link and the github issue.

coope93 commented 4 weeks ago

I posted this in the slack thread, but this is the current update on the investigation into these errors:

At the moment, the best I can figure is something in the gem update on 10/10 caused something to break with how we are reading in the schema file. Still not sure what exactly. We had previously updated the SavedClaim class to use the fully_validate_schema method here, and I suspect that may be where the problem lies since the SavedClaim 's are the only place we are using that method vs using fully_validate, which we use in other places which do not seem to have the same issue. That's my best guess at the moment for this issue. Not sure on the best next steps, however.

coope93 commented 3 weeks ago

The slack link referenced above with Platform support has the most recent information about this issue. The current plan is to merge in the additional logging PR to see if it gives us more context. It does seem like the issue lies in the gem update and something being incompatible, but we have yet to discover the root cause. This issue is consistent for any claims classes that inherit from SavedClaim.

coope93 commented 3 weeks ago

New error logs from the first PR that logs any create exceptions that are not a ValidationErrors or InvalidVeteranStatus exception. Looks like they mirror initial response query as far as logging every time we see this schema error.

coope93 commented 2 weeks ago

Looks like some changes were made when trying to fix this issue from earlier this year. The scheme read error seems to have started after the gem was updated with the changes that were supposed to assist with the caching issue from that bug ticket.

coope93 commented 1 week ago

@allanto-ah I'd like to turn the flipper toggle on to turn off the fully_validate_schema method in staging. We can then submit some 10-10CG's and validate that they didn't fail or anything before we turn it on in prod.

allanto-ah commented 1 week ago

@coope93 Yeah that sounds good to me

allanto-ah commented 1 week ago

Environment: Staging

Verified with the flipper toggle on that there are no regression issues. Submissions work as expected. E2E slack test

coope93 commented 1 week ago

Turned on the flipper toggle to turn off the fully_enable_schema call for saved_claims this morning. As of this morning we are still seeing a few of the errors, so we plan to monitor it today to see if the total is lower or not. Watching these logs. Next steps may be looking into replace the json-schema gem with another gem that is more actively maintained to do the json schema validation. Another team mentioned they may have capacity to take that work on.

Another potential band-aid for the 10-10 team would be to implement some sort of small retry logic for validating the schema, which would allow for better limiting the errors by retrying the failures that are raising exceptions. It would still have the errors, but the impact to the Veterans should be lessened.

coope93 commented 1 week ago

EZR ran into the same error when trying to validate the json schema here. It really looks like it's a bug in the gem somewhere.

hdjustice commented 1 week ago

We got another one of these for the EZ form this morning as well.

coope93 commented 4 days ago

@allanto-ah Trying one more thing on our end before we close this one out. https://github.com/department-of-veterans-affairs/vets-api/pull/19571 adds the ability for the system to automatically retry validating the schema if it fails the first time. It will attempt it a total of 3 times. I think as far as QA we'll want to turn it on in staging and validate that we can still successfully validate? Since the underlying issue is flaky, I'm not sure how we validate it when it throws an exception the first time. But as long as it still works I think we are good.

allanto-ah commented 4 days ago

@coope93 Sounds good, yeah we can turn it on in staging and try to submit to validate. As you said as long nothing is broken, we should be good!

coope93 commented 4 days ago

@allanto-ah And to clarify, it calls this validate before it enqueues the background jobs. So if it does error, a generic error response should get returned the browser and displayed.