Open hdjustice opened 1 month ago
Spent some time looking into this today. These all look to be caused by this error:
Read of file at /usr/local/bundle/cache/ruby/3.3.0/gems/ruby-saml-1.17.0/lib/schemas/2909da20-6ab0-595f-93cf-fdc3e923cdd2 failed.
I want to note that the uuid in the directory path is consistent between all failures 2909da20-6ab0-595f-93cf-fdc3e923cdd2
.
This error is occuring in the saved_claim model and form_matches_schema method (which is shared with other claims/forms), and it looks like that utilzes a json-schema gem.
Additionally, it looks like similar errors are happening for another team is running into the same issue. Their ticket is here.
Planning to look at implementing some more logging in our form validation logic to see if we can get more context on the error.
This datadog query shows intermittent occurrences for seven different forms with distinct service tags having this same issue in the form_matches_schema
method in the SavedClaim
class when using the json-schema gem. This chart shows the forms and their occurrences. It looks like we are intermittently unable to load the schema file to validate against.
Reached out to platform support in slack to see if they have any insight on this. slack link and the github issue.
I posted this in the slack thread, but this is the current update on the investigation into these errors:
At the moment, the best I can figure is something in the gem update on 10/10 caused something to break with how we are reading in the schema file. Still not sure what exactly. We had previously updated the SavedClaim
class to use the fully_validate_schema
method here, and I suspect that may be where the problem lies since the SavedClaim 's are the only place we are using that method vs using fully_validate
, which we use in other places which do not seem to have the same issue. That's my best guess at the moment for this issue. Not sure on the best next steps, however.
The slack link referenced above with Platform support has the most recent information about this issue. The current plan is to merge in the additional logging PR to see if it gives us more context. It does seem like the issue lies in the gem update and something being incompatible, but we have yet to discover the root cause. This issue is consistent for any claims classes that inherit from SavedClaim
.
New error logs from the first PR that logs any create
exceptions that are not a ValidationErrors
or InvalidVeteranStatus
exception. Looks like they mirror initial response query as far as logging every time we see this schema error.
Looks like some changes were made when trying to fix this issue from earlier this year. The scheme read error seems to have started after the gem was updated with the changes that were supposed to assist with the caching issue from that bug ticket.
@allanto-ah I'd like to turn the flipper toggle on to turn off the fully_validate_schema
method in staging. We can then submit some 10-10CG's and validate that they didn't fail or anything before we turn it on in prod.
@coope93 Yeah that sounds good to me
Environment: Staging
Verified with the flipper toggle on that there are no regression issues. Submissions work as expected. E2E slack test
Turned on the flipper toggle to turn off the fully_enable_schema
call for saved_claims this morning. As of this morning we are still seeing a few of the errors, so we plan to monitor it today to see if the total is lower or not. Watching these logs. Next steps may be looking into replace the json-schema
gem with another gem that is more actively maintained to do the json schema validation. Another team mentioned they may have capacity to take that work on.
Another potential band-aid for the 10-10 team would be to implement some sort of small retry logic for validating the schema, which would allow for better limiting the errors by retrying the failures that are raising exceptions. It would still have the errors, but the impact to the Veterans should be lessened.
EZR ran into the same error when trying to validate the json schema here. It really looks like it's a bug in the gem somewhere.
@allanto-ah Trying one more thing on our end before we close this one out. https://github.com/department-of-veterans-affairs/vets-api/pull/19571 adds the ability for the system to automatically retry validating the schema if it fails the first time. It will attempt it a total of 3 times. I think as far as QA we'll want to turn it on in staging and validate that we can still successfully validate? Since the underlying issue is flaky, I'm not sure how we validate it when it throws an exception the first time. But as long as it still works I think we are good.
@coope93 Sounds good, yeah we can turn it on in staging and try to submit to validate. As you said as long nothing is broken, we should be good!
@allanto-ah And to clarify, it calls this validate before it enqueues the background jobs. So if it does error, a generic error response should get returned the browser and displayed.
Background
Reviewing for
#download
errors on the 10-10CG form, it was found that there are some different 500 errors coming up on the V0::CaregiversAssistanceClaimsController#create resource. They are referring to some sort of JSON Schema issue. We need to investigate these errors, determine root cause and if there is any solution we can put in place to either resolve on our side, or guide the user to avoid the errors. We may also need to work with the CG team to confirm the schema requirements match up.Resources
Tasks
Acceptance Criteria