HHS / simpler-grants-gov

https://simpler.grants.gov
Other
45 stars 13 forks source link

Refactor insert pattern for GitHub transform and load #2968

Open widal001 opened 22 hours ago

widal001 commented 22 hours ago

Summary

Currently, the pattern adopting for upserting data in etldb/main.py is to:

  1. Filter for the unique set of Github IDs
  2. Iterate through that list of IDs
  3. Then filter the dataset again to get data for a given ID

See the following sections for examples:

A more common pattern would be to select the distinct set of values you want to insert using pandas DataFrame.drop_duplicates() method and then iterate through the list of data inserting each row or preferably doing a bulk upsert.

In addition to requiring multiple passes through the dataset with $On^k$ complexity (where $n is the size of the source DataFrame and $k$ is the number of unique IDs for a given table) vs a single pass to retrieve all data we want to insert, this pattern also makes it harder to adopt bulk upserts.

Bulk transactions are helpful because they allow us to combine DML into transaction blocks more easily and prevent instances in which tables are only partially updated during a batch data load process.

Acceptance criteria