A more common pattern would be to select the distinct set of values you want to insert using pandas DataFrame.drop_duplicates() method and then iterate through the list of data inserting each row or preferably doing a bulk upsert.
In addition to requiring multiple passes through the dataset with $On^k$ complexity (where $n is the size of the source DataFrame and $k$ is the number of unique IDs for a given table) vs a single pass to retrieve all data we want to insert, this pattern also makes it harder to adopt bulk upserts.
Bulk transactions are helpful because they allow us to combine DML into transaction blocks more easily and prevent instances in which tables are only partially updated during a batch data load process.
Acceptance criteria
[ ] We've reduced the number of passes through the dataframe needed to extract the unique set of values to insert into the database
[ ] The data is loaded using bulk DML statements rather than separate insert/update statements for each row of data.
Summary
Currently, the pattern adopting for upserting data in
etldb/main.py
is to:See the following sections for examples:
A more common pattern would be to select the distinct set of values you want to insert using pandas DataFrame.drop_duplicates() method and then iterate through the list of data inserting each row or preferably doing a bulk upsert.
In addition to requiring multiple passes through the dataset with $
On^k
$ complexity (where $n
is the size of the source DataFrame and $k
$ is the number of unique IDs for a given table) vs a single pass to retrieve all data we want to insert, this pattern also makes it harder to adopt bulk upserts.Bulk transactions are helpful because they allow us to combine DML into transaction blocks more easily and prevent instances in which tables are only partially updated during a batch data load process.
Acceptance criteria