Open sambodeme opened 9 months ago
Examples where a target value is determined by more than one source value:
An example where the source is actually in different tables is in:
(passthrough_names, passthrough_ids) = _fix_passthroughs(cfdas)
Un related note: This function is a good candidate for refactoring so it operates on one cfda at a time rather than a list of cfdas.
Perhaps @sambodeme and others will have a different suggestion, but given the nature of this table, if it makes sense to declare the columns for source names as type ARRAY
. That would give us a consistent type and address the question of logging multiple sources. (Will there ever be multiple destinations in a single mapping? I hope not? At most, we have 1:1 or Many:1, not Many:Many mappings?)
This does mean that the source (when "simple") will be an array of length one, but... that feels like a reasonable tradeoff to allow for multiple source fields contributing to a single destination.
If y'all go a different direction, that's fine, but having a consistent type makes sense to me. "The census_column_name
will instead be census_column_names
, and it is always of type text[]
."
It looks like Django would call this ArrayField(TextField(...))
https://docs.djangoproject.com/en/4.2/ref/contrib/postgres/fields/#arrayfield
Whether we want a situation where we have {"table1", "table2"}
in the table field, and {"column1", "column2"}
in the name field is a design question... in a different context, I really wouldn't like this design. However, the goal is to log what a mapping function did, and it is something we control (meaning, this is algorithmically generated, not manually entered), so it should be OK/reliable. In other words: I hope the team does not get bogged down in seeking the perfect as opposed to the good.
I'd consider MigrationChangeRecord
as opposed to Log
. Because we may also be logging (using the logging
module) things as part of this, we should probably distinguish the difference between logs that end up being shipped to New Relic, and records we generate as part of the migration action.
is_public
Sorry to complicate the conversation (slightly)...
Should a new-style report ID be logged in this table?
We will be migrating tribal data. This will ultimately produce an entry in the general
table that has the is_public
field correctly set.
To query against the tables that result from migrations, we'll need a report_id as a FK into this table. Is that a post-processing step on these logs? Or, should we include a report_id
as part of this logging, so that we can easily determine in queries and views that a given row in the migration record table is public (based on the report_id
as a PK/FK and the is_public
value in general
) ?
Note I am very much not in favor of any post-processing steps if we can at all avoid them. Once a migration happens, it would be nice if the migration record was complete and ready for use/consumption elsewhere.
MigrationChangeRecord table needs to persist in GSAFAC and hence needs to be part of the audit app.
@jadudm we definitely want to include report_id in MigrationChangeRecord
. This was a mistake on my end when drafting the model. I forgot to add report_id, and it was called out during a group meeting we had earlier. I updated the ADR.
Using the ArrayField option seems like a simpler approach; however, it does require ArrayField for table_names
, column_names
, and data
. Hence, I am not sure how easy it will be for a user to determine which data comes from which table/column when looking at census_**
. In the updated ADR, I added an alternative option based on JSONField for us to explore.
No apologies needed. We do this so we can discuss and solve things together.
I don't know if ArrayFields guarantee order. A JSON field would achieve the same thing. Mostly, I'm suggesting "do the simple thing" here, rather than trying to imagine (say) using some kind of 1:Many relationship to solve the problem. Given that the table is computationally generated, and will likely only ever be computationally processed, embedding some JSON or an array here makes sense to me. You and the team working on it should make the recommendation, though, that makes sense for the work you're doing and that meets the needs.
@sambodeme I like the idea of using a JSON field as it is consistent with what we do in other parts of our application.
I think the ForeignKey field should be named sac
or something similar rather than report_id
. The reference is to a whole instance of the object rather than to a field within the instance.
There seems to be a consensus around the JSON-based model for the fields in Migration change Records, so I updated the document accordingly.
I don't think this has foreign keys, per se. It 's a record of the changes, and the sac
object 1) is internal, and 2) possibly won't be retained for historical data that is migrated. (Unless that's understood/decided.)
So, recording the report_id
for what will be disseminated data is, I would think, the correct thing to include.
Capturing discussion points:
MigrationChangeRecord
live in?
Let's merge with https://github.com/GSA-TTS/FAC/issues/2913 and get a PR opened this week.
Areas of impact
Related documents/links
Context
The FAC data migration from Census to GSA follows an iterative approach summarized as follows: The migration algorithm is executed, logging details of both successful and unsuccessful migrations. The team then examines these logs, updates the algorithm to address failures, and re-runs the migration for the failed reports. If data transformation is needed for these reports to pass, the algorithm is updated accordingly, and the process is repeated until the migration is complete. This ADR focuses primarily on logging (successful/failed migrations and transformations). For a broader view of the migration process, refer to ADR linked in the previous section.
Decision
There are three key logging aspects in the historical data migration process:
General Logging: This includes standard logs such as info and debug logs, which are common in most algorithms to provide useful insights during execution. Given the critical nature of this migration task, it's essential to ensure these logs are well-placed and written to aid debugging when needed.
Logging Migration Attempts (Success/Failure): Each migration run will result in some reports successfully migrating and others failing. It's important to record adequate details of both to filter out successfully migrated reports in subsequent iterations. To track this, we'll use a Django model similar to the following:
audit_year
anddbkey
will store the AUDITYEAR and DBKEY, respectively.run_datetime
will record the migration attempt's timestamp, andmigration_status
will indicate whether the report was successfully migrated.MigrationErrorDetail Model: This model will link to
ReportMigrationStatus
via a foreign key relationship, enabling a direct association between a migration attempt and its corresponding error details.tag
will provide a tag for the error,exception_class
will provide the class of the error, 'detail' will contain the full description of the error.Logging Data Transformations (Change Logs): This logging is crucial for documenting any modifications to the data throughout the migration process. It is vital to record the state of the data both before and after transformation, along with any other pertinent details for auditing purposes. The following Django model is utilized to track these changes:
The team has opted for this model as it appears more flexible and allows the use of nested JSON objects within
census_data
to represent the relationships between tables, columns, and data values. Below is a sample:Database Tables Strategy The GSA/FAC application relies on a default database for all its data persistence requirements. During the migration process, there was a consideration to integrate an additional database into the application's infrastructure. This new database was intended to host tables that might not stay active post-migration. However, anticipating scenarios where re-running the data migration could become necessary, the decision was made to maintain all data in a live state for the time being. Consequently, the plan to introduce a second database was set aside.
Consequences
ReportMigrationStatus
model tracks migration status and makes it possible to compute the success-to-failure ratio after each iteration. Ideally, the failure rate should be 0% after all iterations for a given audit year.MigrationInspectionRecord
model provides a comprehensive record of data changes for audit purposes.