sambodeme commented 8 months ago

While preparing the data migration documentation, it was noted that AUDITEENAME was incorrectly used instead of AUDITEECERTIFYNAME, and AUDITEETITLE was used instead of AUDITEECERTIFYTITLE. It was determined that this will have a low impact on the disseminated reports as it does not affect the financial aspect of the audit reports and only affects the auditee certifying information. However, because this still introduces some data inaccuracies in the reports in production, it must be addressed.

Possible solutions:

Correct the issue in the source records and re-run dissemination for all historic reports.
Fix the issue on the dissemination side (general table) following a change record approach.

danswick commented 6 months ago

My initial reaction is that option 1 seems preferable. Option 2 would be overwritten if we ever had to re-run dissemination for some other reason.

Another possible approach would be to handle both cases during intake-to-dissemination. We could check for both cases and pass the appropriate one on to the disseminated record.

jadudm commented 6 months ago

1 is very expensive in time. Re-running a year of data on a single-core cloud.gov instance takes approximately two weeks.

A third option:

Bring the singleauditreport records in to the production tables.
Add administrative API coverage for that table.
Add an administrative API endpoint that allows for updating records (with appropriate change-management tracking, per @sambodeme 's comment)
Add the ability to re-run dissemination on a record via API.

We could, in this way, begin doing data curation via API, eliminating some of the challenges of trying to do all of this as GH Actions.

I've just suggested an entirely new idea that needs discussion, I think. I'm wrestling with/thinking this way because:

The fact that we need a "source of truth" somehwere, which is our singleauditchecklist. We want the change made there, to avoid the problem @danswick pointed out.
We have a difficult time working with production data, because of our controls. (This is a good thing, but it is something that forces us to do work in lower environments and then manually move data in.)
We have controls built into the admin API that could let us "bless" one or more keys as "write" keys, and in doing so, limit the possibility of "accidents" against the source data.
The API can provide the logging and change tracking we need.
We have no easy way to re-disseminate a record, when it should be a cheap operation/easy to do.
We have multiple other (similar) issues we need to address, and we need to make data curation less expensive. The API moves data curation into the space of stand-alone Python (or, any language) tools that we can all test and validate against lower environments, and run with confidence against production.

Probably a few other things. I have no idea how this would play with the existing migration tooling... it might be a non-starter. But, for ongoing curation work, this might be worth discussing?

danswick commented 3 months ago

We'll tackle this as part of the next batch of curation work.

sambodeme commented 5 days ago

As I began reviewing the code and scaffolding the necessary logic to fix the auditee name and title (see ticket #3402), I realized that data curation might be needed to address issues with historical records migrated from Census data. This could be due to various reasons, including bugs in the migration algorithm that were not identified at the time and are now surfacing (or may surface in the future). Additionally, there might be a need to update records in the FAC databases, regardless of their origin. This typically occurs when the FAC team modifies intake validation rules, resulting in existing records that no longer validate against the new rules without updates.

When data curation involves historical records, fixing these issues will often require accessing raw data from the historical Census records table and reusing logic from the census_historical_migration app. The reason for reusing this logic is to maintain consistency, such as the way we handled missing values during migration by replacing them with the GSA_MIGRATION placeholder.

This situation raises questions about how and where to organize the data curation code. Should we create a new app (data-curation) within the Django project and consolidate all data curation work there? This approach has the advantage of centralizing all data curation efforts in one place but may lead to the new app becoming too dependent on others, such as the census_historical_migration or audit apps.

Alternatively, should we include a curation section within each app (one for the census_historical_migration app and one for the audit app)? This approach would make the apps more self-contained and loosely coupled, reducing dependencies between them. However, it also means the curation logic would be spread across multiple apps.

jadudm commented 5 days ago

Thinking out loud...

The historical_migration code assumes:

The data coming in is formatted as a submission, and
It will need either transformation or permission to go through "incorrectly."

It would be heavyweight, but could all curation be implemented as

Generate a new submission with the updated/fixed data,
Pass it through the migration code

Such that the migration code is, for all intents and purposes, the only place we do this work?

(This is a third option. I haven't thought about how odd or heavyweight it might turn out to be.)

My intuition/assumptions so far have been that having a curation app would be the way to go.

For this particular issue, I've been assuming a management command would have access to both the sac and the census_historical tables. Therefore, the operation is basically:

Loop
1. Visit each SAC
2. Get the correct names from the historical tables
3. with curation_tracking:
  1. Make the change in the sac
  2. Save the sac
4. Re-disseminate the record

That is, I've been assuming we have 1) the current record and 2) the historical record in hand for all curation work, and therefore each action looks more like a management command that is probably only run once?

GSA-TTS / FAC

Fix Certifying Auditee field names in historical data #3402

1 is very expensive in time. Re-running a year of data on a single-core cloud.gov instance takes approximately two weeks.