Open sambodeme opened 8 months ago
My initial reaction is that option 1 seems preferable. Option 2 would be overwritten if we ever had to re-run dissemination for some other reason.
Another possible approach would be to handle both cases during intake-to-dissemination. We could check for both cases and pass the appropriate one on to the disseminated record.
A third option:
singleauditreport
records in to the production tables.We could, in this way, begin doing data curation via API, eliminating some of the challenges of trying to do all of this as GH Actions.
I've just suggested an entirely new idea that needs discussion, I think. I'm wrestling with/thinking this way because:
singleauditchecklist
. We want the change made there, to avoid the problem @danswick pointed out.Probably a few other things. I have no idea how this would play with the existing migration tooling... it might be a non-starter. But, for ongoing curation work, this might be worth discussing?
We'll tackle this as part of the next batch of curation
work.
As I began reviewing the code and scaffolding the necessary logic to fix the auditee name and title (see ticket #3402), I realized that data curation might be needed to address issues with historical records migrated from Census data. This could be due to various reasons, including bugs in the migration algorithm that were not identified at the time and are now surfacing (or may surface in the future). Additionally, there might be a need to update records in the FAC databases, regardless of their origin. This typically occurs when the FAC team modifies intake validation rules, resulting in existing records that no longer validate against the new rules without updates.
When data curation involves historical records, fixing these issues will often require accessing raw data from the historical Census records table and reusing logic from the census_historical_migration app. The reason for reusing this logic is to maintain consistency, such as the way we handled missing values during migration by replacing them with the GSA_MIGRATION placeholder.
This situation raises questions about how and where to organize the data curation code. Should we create a new app (data-curation) within the Django project and consolidate all data curation work there? This approach has the advantage of centralizing all data curation efforts in one place but may lead to the new app becoming too dependent on others, such as the census_historical_migration or audit apps.
Alternatively, should we include a curation section within each app (one for the census_historical_migration app and one for the audit app)? This approach would make the apps more self-contained and loosely coupled, reducing dependencies between them. However, it also means the curation logic would be spread across multiple apps.
Thinking out loud...
The historical_migration code assumes:
It would be heavyweight, but could all curation be implemented as
Such that the migration code is, for all intents and purposes, the only place we do this work?
(This is a third option. I haven't thought about how odd or heavyweight it might turn out to be.)
My intuition/assumptions so far have been that having a curation
app would be the way to go.
For this particular issue, I've been assuming a management command would have access to both the sac
and the census_historical
tables. Therefore, the operation is basically:
historical
tablescuration_tracking
:
sac
sac
That is, I've been assuming we have 1) the current record and 2) the historical record in hand for all curation work, and therefore each action looks more like a management command that is probably only run once?
While preparing the data migration documentation, it was noted that
AUDITEENAME
was incorrectly used instead ofAUDITEECERTIFYNAME
, andAUDITEETITLE
was used instead ofAUDITEECERTIFYTITLE
. It was determined that this will have a low impact on the disseminated reports as it does not affect the financial aspect of the audit reports and only affects the auditee certifying information. However, because this still introduces some data inaccuracies in the reports in production, it must be addressed.Possible solutions: