[ADR] Historical Data Migration: Implementing Strategic Logging and Transformation Tracking

sambodeme commented 9 months ago

Areas of impact

[ ] Compliance
[ ] Content
[ ] CX
[ ] Design
[ ] Engineering
[ ] Policy
[ ] Product
[x] Process
[ ] UX

Context

The FAC data migration from Census to GSA follows an iterative approach summarized as follows: The migration algorithm is executed, logging details of both successful and unsuccessful migrations. The team then examines these logs, updates the algorithm to address failures, and re-runs the migration for the failed reports. If data transformation is needed for these reports to pass, the algorithm is updated accordingly, and the process is repeated until the migration is complete. This ADR focuses primarily on logging (successful/failed migrations and transformations). For a broader view of the migration process, refer to ADR linked in the previous section.

Decision

There are three key logging aspects in the historical data migration process:

General Logging: This includes standard logs such as info and debug logs, which are common in most algorithms to provide useful insights during execution. Given the critical nature of this migration task, it's essential to ensure these logs are well-placed and written to aid debugging when needed.
Logging Migration Attempts (Success/Failure): Each migration run will result in some reports successfully migrating and others failing. It's important to record adequate details of both to filter out successfully migrated reports in subsequent iterations. To track this, we'll use a Django model similar to the following:
```
class ReportMigrationStatus(models.Model):
   audit_year = models.TextField(blank=True, null=True)
   dbkey = models.TextField(blank=True, null=True)
   run_datetime = models.DateTimeField(default=timezone.now)
   migration_status = models.TextField(blank=True, null=True)
```
audit_year and dbkey will store the AUDITYEAR and DBKEY, respectively. run_datetime will record the migration attempt's timestamp, and migration_status will indicate whether the report was successfully migrated.

MigrationErrorDetail Model: This model will link to ReportMigrationStatus via a foreign key relationship, enabling a direct association between a migration attempt and its corresponding error details.
```
class MigrationErrorDetail(models.Model):
   report_migration_status_id = models.ForeignKey(ReportMigrationStatus, on_delete=models.CASCADE)
   tag = models.TextField(blank=True, null=True)
   exception_class = models.TextField(blank=True, null=True)
   detail = models.TextField(blank=True, null=True)
```
tag will provide a tag for the error, exception_class will provide the class of the error, 'detail' will contain the full description of the error.

Logging Data Transformations (Change Logs): This logging is crucial for documenting any modifications to the data throughout the migration process. It is vital to record the state of the data both before and after transformation, along with any other pertinent details for auditing purposes. The following Django model is utilized to track these changes:

class MigrationInspectionRecord(models.Model):
  audit_year = models.TextField(blank=True, null=True)
  dbkey = models.TextField(blank=True, null=True)
  report_id = models.TextField(blank=True, null=True)
  run_datetime = models.DateTimeField(default=timezone.now)
  finding_text = JSONField(blank=True, null=True)
  additional_uei  = JSONField(blank=True, null=True)
  additional_ein  = JSONField(blank=True, null=True)
  finding  = JSONField(blank=True, null=True)
  federal_award  = JSONField(blank=True, null=True)
  cap_text  = JSONField(blank=True, null=True)
  note  = JSONField(blank=True, null=True)
  passthrough  = JSONField(blank=True, null=True)
  general  = JSONField(blank=True, null=True)
  secondary_auditor  = JSONField(blank=True, null=True)

The team has opted for this model as it appears more flexible and allows the use of nested JSON objects within census_data to represent the relationships between tables, columns, and data values. Below is a sample:

{
    "census_data": [
        {
            "column": "sample_col1",
            "value": "sample_data1"
        },
        {
            "column": "sample_col2",
            "value": "sample_data2"
        }
    ],
    "gsa_fac_data": {
        "field": "sample_gsa_fac_field",
        "value":  "sample_value"
    },
    "transformation_functions": "function_name"
}

Database Tables Strategy The GSA/FAC application relies on a default database for all its data persistence requirements. During the migration process, there was a consideration to integrate an additional database into the application's infrastructure. This new database was intended to host tables that might not stay active post-migration. However, anticipating scenarios where re-running the data migration could become necessary, the decision was made to maintain all data in a live state for the time being. Consequently, the plan to introduce a second database was set aside.

Consequences

The ReportMigrationStatus model tracks migration status and makes it possible to compute the success-to-failure ratio after each iteration. Ideally, the failure rate should be 0% after all iterations for a given audit year.
The MigrationInspectionRecord model provides a comprehensive record of data changes for audit purposes.

JeanMarie-PM commented 9 months ago

Examples where a target value is determined by more than one source value:

cfda.STATECLUSTERNAME = ( cfda.STATECLUSTERNAME.strip().upper() if "STATE CLUSTER" == cfda.CLUSTERNAME else "" )
cfda.TYPEREPORT_MP = "" if cfda.MAJORPROGRAM == "N" else cfda.TYPEREPORT_MP

JeanMarie-PM commented 9 months ago

An example where the source is actually in different tables is in: (passthrough_names, passthrough_ids) = _fix_passthroughs(cfdas) Un related note: This function is a good candidate for refactoring so it operates on one cfda at a time rather than a list of cfdas.

jadudm commented 9 months ago

Perhaps @sambodeme and others will have a different suggestion, but given the nature of this table, if it makes sense to declare the columns for source names as type ARRAY. That would give us a consistent type and address the question of logging multiple sources. (Will there ever be multiple destinations in a single mapping? I hope not? At most, we have 1:1 or Many:1, not Many:Many mappings?)

This does mean that the source (when "simple") will be an array of length one, but... that feels like a reasonable tradeoff to allow for multiple source fields contributing to a single destination.

If y'all go a different direction, that's fine, but having a consistent type makes sense to me. "The census_column_name will instead be census_column_names, and it is always of type text[]."

It looks like Django would call this ArrayField(TextField(...))

https://docs.djangoproject.com/en/4.2/ref/contrib/postgres/fields/#arrayfield

Whether we want a situation where we have {"table1", "table2"} in the table field, and {"column1", "column2"} in the name field is a design question... in a different context, I really wouldn't like this design. However, the goal is to log what a mapping function did, and it is something we control (meaning, this is algorithmically generated, not manually entered), so it should be OK/reliable. In other words: I hope the team does not get bogged down in seeking the perfect as opposed to the good.

I'd consider MigrationChangeRecord as opposed to Log. Because we may also be logging (using the logging module) things as part of this, we should probably distinguish the difference between logs that end up being shipped to New Relic, and records we generate as part of the migration action.

report_id and `is_public`

Sorry to complicate the conversation (slightly)...

Should a new-style report ID be logged in this table?

We will be migrating tribal data. This will ultimately produce an entry in the general table that has the is_public field correctly set.

To query against the tables that result from migrations, we'll need a report_id as a FK into this table. Is that a post-processing step on these logs? Or, should we include a report_id as part of this logging, so that we can easily determine in queries and views that a given row in the migration record table is public (based on the report_id as a PK/FK and the is_public value in general) ?

Note I am very much not in favor of any post-processing steps if we can at all avoid them. Once a migration happens, it would be nice if the migration record was complete and ready for use/consumption elsewhere.

gsa-suk commented 9 months ago

MigrationChangeRecord table needs to persist in GSAFAC and hence needs to be part of the audit app.

sambodeme commented 9 months ago

@jadudm we definitely want to include report_id in MigrationChangeRecord. This was a mistake on my end when drafting the model. I forgot to add report_id, and it was called out during a group meeting we had earlier. I updated the ADR. Using the ArrayField option seems like a simpler approach; however, it does require ArrayField for table_names, column_names, and data. Hence, I am not sure how easy it will be for a user to determine which data comes from which table/column when looking at census_**. In the updated ADR, I added an alternative option based on JSONField for us to explore.

jadudm commented 9 months ago

No apologies needed. We do this so we can discuss and solve things together.

I don't know if ArrayFields guarantee order. A JSON field would achieve the same thing. Mostly, I'm suggesting "do the simple thing" here, rather than trying to imagine (say) using some kind of 1:Many relationship to solve the problem. Given that the table is computationally generated, and will likely only ever be computationally processed, embedding some JSON or an array here makes sense to me. You and the team working on it should make the recommendation, though, that makes sense for the work you're doing and that meets the needs.

JeanMarie-PM commented 9 months ago

@sambodeme I like the idea of using a JSON field as it is consistent with what we do in other parts of our application. I think the ForeignKey field should be named sac or something similar rather than report_id. The reference is to a whole instance of the object rather than to a field within the instance.

sambodeme commented 9 months ago

There seems to be a consensus around the JSON-based model for the fields in Migration change Records, so I updated the document accordingly.

jadudm commented 9 months ago

I don't think this has foreign keys, per se. It 's a record of the changes, and the sac object 1) is internal, and 2) possibly won't be retained for historical data that is migrated. (Unless that's understood/decided.)

So, recording the report_id for what will be disseminated data is, I would think, the correct thing to include.

danswick commented 9 months ago

Capturing discussion points:

How will the records be read out?
- Can be decided later.
Should the current GSA JSON schema be captured during migration?
- Date should be sufficient.
Which app should MigrationChangeRecord live in?
- Dissemination or audit would be fine.
Copying @lauraherring for visibility. We'll likely incorporate these decisions into a "Census-to-GSA migration" guide at some point in the future.
Retaining the created/"synthetic" SAC intake objects.
- If we remove them, it will be difficult to recreate dissemination without inadvertently leaving historic data behind. Possible to re-migrate the historic data first.
- What is the risk of keeping them?
- They could live in a different production database.

danswick commented 2 months ago

Let's merge with https://github.com/GSA-TTS/FAC/issues/2913 and get a PR opened this week.

GSA-TTS / FAC