Closed mdbenjam closed 7 years ago
Proposed solution from our discussion in #2621: create a table that provides a complete audit log of all writes to VACOLS. Columns could include:
We should create a method that all code uses to write to VACOLS. This method will both do the write and create a row in the audit table. No code should write to VACOLS outside of this method.
We may need to be careful about transactions, rollbacks, and failed writes. We don't want to write to the audit table, have the VACOLS write fail, and then have erroneous data in the audit table.
The goal is that with this table, we can have complete data recovery from any corrupt data that Caseflow writes. It should also aid in debugging Caseflow.
I'm sorry if I'm back tracking here, but the more I think about this, the more I'm thinking this might be an over engineered solution to the problem.
It should be difficult, but doable to restore data. I'd like to put more effort into getting the VACOLS writes correct the first time. Did these VACOLS bugs happen even after 2 code reviews and Artem's QA gauntlet?
And when we do need to restore data, this is what snapshots are for. They are useful for restoring VACOLS data for any reason, not just a Caseflow bug. Do we have them? Do we know how to restore data from them?
Additionally, how did we fix the earlier VACOLS issues? Were they fixable?
This is how we recovered from the last VACOLS disaster. https://github.com/department-of-veterans-affairs/caseflow/issues/2268
Regardless of how you try to make it right the first time, mistakes will be made. Therefore, we need to be able to recover when that day comes.
The VACOLS situation is bad. We rely on VA IT to recover snapshots, but the timing is not reliable. (Imagine NSD level support).
Alternatively, we can put all our energy on getting DMS to work to make sure VACOLS is fully replicated and backed up in the cloud.
It should be difficult, but doable to restore data. I'd like to put more effort into getting the VACOLS writes correct the first time. Did these VACOLS bugs happen even after 2 code reviews and Artem's QA gauntlet?
Yes, my smoke tests make sure that we do what the code does; they do not check business logic. So if we're happy to write wrong data to VACOLS, then it will pass.
X
then I will follow that. If we find out later it should've been V
then...well...shit.Couple of ways to minimize the damage in the future:
So @askldjd we have no capability to do snapshots on our own?
If we have snapshots, we still only have as much safety as the freshness of the snapshot. Imagine that snapshots are taken every hour, and the following timeline occurs:
We cannot fully recover that data, since we don't have a snapshot that's fresh enough. In practice, it may be that this timeline is unlikely to come up, but I'm nervous since a single mistake can ruin someone's life. @askldjd what do you think about the likelihood here? Am I being too paranoid?
@shanear, I agree that over-engineering is bad. How do you feel in response to @askldjd and @kierachell's comments? I agree with you that prevention is better than cure, but I also agree with @askldjd that we can't be 100% confident in any prevention measures.
The following implementation makes this obsolete: https://github.com/department-of-veterans-affairs/VACOLS/blob/1d31c01e5410d9b1c5a52fd5d454914ab6f56da4/dms/enable-supplemental-logs-in-vacols.md
supplemental logs are essentially an audit table
Based on our conversation from this ticket https://github.com/department-of-veterans-affairs/caseflow/issues/2621 we want to create an audit table for any writes to VACOLS. This way it is easy to restore corrupted data, or to catch such corruptions early.
Edits from @nickheiner
The output of this ticket should be a tech spec. The ticket can be closed when the tech spec is socialized across the team and we get consensus on the way to move forward. Follow-up tickets should be created for implementation.
Open questions (from discussion)