[Research Spike] Audit Table for VACOLS writes

mdbenjam commented 7 years ago

Based on our conversation from this ticket https://github.com/department-of-veterans-affairs/caseflow/issues/2621 we want to create an audit table for any writes to VACOLS. This way it is easy to restore corrupted data, or to catch such corruptions early.

Edits from @nickheiner

The output of this ticket should be a tech spec. The ticket can be closed when the tech spec is socialized across the team and we get consensus on the way to move forward. Follow-up tickets should be created for implementation.

Open questions (from discussion)

How can we get all the pathways where we write to VACOLS? (ActiveRecord, direct SQL, etc)
Can we do this without a major performance hit, if we need to look up the previous value of each field before we write it?
Do we want to have automated tools to restore VACOLS data from this audit table, or to otherwise present the narrative the table provides in a more human-readable way?

NickHeiner commented 7 years ago

Proposed solution from our discussion in #2621: create a table that provides a complete audit log of all writes to VACOLS. Columns could include:

Timestamp
App doing the write (eg Reader, Certification, Dispatch, Hearings Prep)
Optionally another string label for the app to document why the write is happening, if it's otherwise ambiguous.
Table name
Column name
Previous value
Next value

We should create a method that all code uses to write to VACOLS. This method will both do the write and create a row in the audit table. No code should write to VACOLS outside of this method.

We may need to be careful about transactions, rollbacks, and failed writes. We don't want to write to the audit table, have the VACOLS write fail, and then have erroneous data in the audit table.

The goal is that with this table, we can have complete data recovery from any corrupt data that Caseflow writes. It should also aid in debugging Caseflow.

shanear commented 7 years ago

I'm sorry if I'm back tracking here, but the more I think about this, the more I'm thinking this might be an over engineered solution to the problem.

It should be difficult, but doable to restore data. I'd like to put more effort into getting the VACOLS writes correct the first time. Did these VACOLS bugs happen even after 2 code reviews and Artem's QA gauntlet?

And when we do need to restore data, this is what snapshots are for. They are useful for restoring VACOLS data for any reason, not just a Caseflow bug. Do we have them? Do we know how to restore data from them?

Additionally, how did we fix the earlier VACOLS issues? Were they fixable?

askldjd commented 7 years ago

This is how we recovered from the last VACOLS disaster. https://github.com/department-of-veterans-affairs/caseflow/issues/2268

Regardless of how you try to make it right the first time, mistakes will be made. Therefore, we need to be able to recover when that day comes.

The VACOLS situation is bad. We rely on VA IT to recover snapshots, but the timing is not reliable. (Imagine NSD level support).

Alternatively, we can put all our energy on getting DMS to work to make sure VACOLS is fully replicated and backed up in the cloud.

ghost commented 7 years ago

It should be difficult, but doable to restore data. I'd like to put more effort into getting the VACOLS writes correct the first time. Did these VACOLS bugs happen even after 2 code reviews and Artem's QA gauntlet?

Yes, my smoke tests make sure that we do what the code does; they do not check business logic. So if we're happy to write wrong data to VACOLS, then it will pass.

I can add business logic validation, but if I am told that column A should be marked X then I will follow that. If we find out later it should've been V then...well...shit.
Another issue is refactoring and "make sure everything just works" AC. Which basically tells me: retest everything ever and make sure there isn't regressions. Well, I test that nothing is "broken" but digging up age-old closed issues is a PIA. Sure, tell me to do and I will - but I was told "don't need regression tests." Another issue is that automated tests check that "things happen" and not that "things don't happen" - e.g. writing VACOLS notes when not needed

Couple of ways to minimize the damage in the future:

Regression matricies
Test that make sure that we don't do things
Avoid "make sure everything just works" AC
QA more mindful of the VACOLS writing logs. (maybe a "vacols write" or some kind of label would help alert us?)

NickHeiner commented 7 years ago

So @askldjd we have no capability to do snapshots on our own?

If we have snapshots, we still only have as much safety as the freshness of the snapshot. Imagine that snapshots are taken every hour, and the following timeline occurs:

4:00 – Snapshot taken
4:23 – VACOLS app writes data to a row
4:43 – Caseflow writes bad data to that same row

We cannot fully recover that data, since we don't have a snapshot that's fresh enough. In practice, it may be that this timeline is unlikely to come up, but I'm nervous since a single mistake can ruin someone's life. @askldjd what do you think about the likelihood here? Am I being too paranoid?

@shanear, I agree that over-engineering is bad. How do you feel in response to @askldjd and @kierachell's comments? I agree with you that prevention is better than cure, but I also agree with @askldjd that we can't be 100% confident in any prevention measures.

ghost commented 7 years ago

The following implementation makes this obsolete: https://github.com/department-of-veterans-affairs/VACOLS/blob/1d31c01e5410d9b1c5a52fd5d454914ab6f56da4/dms/enable-supplemental-logs-in-vacols.md

supplemental logs are essentially an audit table

department-of-veterans-affairs / caseflow

[Research Spike] Audit Table for VACOLS writes #2681

Edits from @nickheiner

Open questions (from discussion)