Add `skip_conflicts` option to `load` function

BritishGeologicalSurvey / etlhelper

ETL Helper is a Python ETL library to simplify data transfer into and out of databases.

https://britishgeologicalsurvey.github.io/etlhelper/

GNU Lesser General Public License v3.0

104 stars 25 forks source link

Add `skip_conflicts` option to `load` function #143

Open volcan01010 opened 1 year ago

volcan01010 commented 1 year ago

Note: this might be a bit complicated - maybe users should just use executemany and copy_rows and write their own INSERT statements to ensure that they get exactly what they want.

Summary

As an ETL Helper user I want to use skip_conficts so that I can handle primary key violations at the database level without requiring ETL Helper's error handling.

Description

A common error in a big insert is where records already exist for a given primary key. Some databases have internal methods for handling these scenarios. For example SQLite has the ON CONFLICT or INSERT syntax: https://www.sqlite.org/lang_conflict.html. PostgreSQL has ON CONFLICT DO NOTHING: https://www.postgresql.org/docs/current/sql-insert.html#:~:text=ON%20CONFLICT%20DO%20NOTHING%20simply,can%20perform%20unique%20index%20inference.

There should be implementations for each database. The specific way in which they are applied will need to be configured in the DbHelper classes. The SQL statement generated for the load function may need changes around the INSERT keyword, or as a suffix.

Acceptance critera

[ ] Calling load with skip_conflicts=True means that primary key violations pass silently and rows are unchanged

volcan01010 commented 1 year ago

Having looked further into this, it won't be easy to implement. There are two main hurdles:

Writing a SQL command that includes solving of conflicts requires ETLHelper to know what the primary key is for a table. It turns out that this isn't straightforward to find out. We discovered this when we wrote the table_info command. For example, a primary key may be a composite of multiple columns.
The syntax for different databases is different. This is less of a problem and is what ETL Helper was designed for. But there would still be a bit of work involved.

Here is some code that ChatGPT wrote to demonstrate what the command should be for PostgreSQL. You can see that it explicitly uses the id command.

Screenshot 2023-05-05 at 09-09-02 ChatGPT

volcan01010 commented 1 year ago

ChatGPT was also able to generate equivalent commands for Oracle, SQLite and MS SQL Server. There are two things to note:

Oracle and SQL Server use MERGE, where SQLite and PostgreSQL use INSERT ... ON CONFLICT
MERGE commands also require knowlege of the primary key for the ON clause

Screenshot 2023-05-05 at 13-49-41 ChatGPT Screenshot 2023-05-05 at 13-50-10 ChatGPT

volcan01010 commented 1 year ago

Here is ChatGPT's suggestion for:

Please write a SQL command that lists the primary key columns for a table in a PostgreSQL database.

SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
    SELECT constraint_name
    FROM information_schema.table_constraints
    WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
);

And the translation to different database systems via:

Can you give equivalent commands for Oracle, SQLite and MS SQL Server databases?

Oracle:

SELECT column_name
FROM all_cons_columns
WHERE constraint_name = (
    SELECT constraint_name
    FROM all_constraints
    WHERE table_name = 'your_table_name' AND constraint_type = 'P'
);

SQLite:

PRAGMA table_info(your_table_name)
WHERE pk > 0;

MS SQL Server:

SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
    SELECT constraint_name
    FROM information_schema.table_constraints
    WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
);

Note that I haven't checked them, and that the MS SQL Server value looks suspiciously similar to the PostgreSQL one.

volcan01010 commented 1 year ago

I think that we could probably do this at some point.

It would be better to implement this as a new function, merge, rather than updating the load function as that would better reflect the fact that the underlying mechanism is different.

This would use a function behind the scenes called generate_merge_sql instead of generate_insert_sql that is used by load.

kerberpolis commented 1 year ago

I've used COPY to transfer rows in a db migration; this can be 10,000s or millions. My process is:

I export items from a source databse
Augment the data
In batches of 5000 copy the augmented data to the target table.

In one table there is a composite unqiue key (column_X, column_Y) where if there is a conflict in 1 of the 5000, because it is a single tranaction and the cursor is not commited, none of rows are copied. On a conflict I would like to ignore ON CONFLICT (column_X, column_Y) DO NOTHING as they have already been migrated (so this is relevant for skip_conflict=True only.)

You can also skip conflicts using ON CONFLICT ON CONSTRAINT [constraint_name] DO NOTHING.

Unfortunately you cannot ON CONFLICT (id, column_X, column_Y) ... as the PK and UQ constraints are independant. Neither can you chain them ON CONFLICT (id) ... ON CONFLICT (column_X, column_Y) ...

With that in mind, if it fits your use cases too it would be great if a copy function could specify a constraint_name instead of only being a PK. For case witth more than one constraint, it would require the developer to either pre-filter their data beforehand. Or perform some kind of upsert but at that point it might be easier to supply their own SQL statement...

volcan01010 commented 1 year ago

The first task, whatever we do is to determine the primary keys of a table.

The specific queries are given above.

The primary key columns can be added as an extra attribute to the table_info response.

https://github.com/BritishGeologicalSurvey/etlhelper/blob/456f0db009524354d0377621067ba56e8ac441da/etlhelper/utils.py#L13

The current table_info response looks like:

https://github.com/BritishGeologicalSurvey/etlhelper/blob/456f0db009524354d0377621067ba56e8ac441da/test/integration/test_utils.py#L10

We would need to add a primary_key_query to each DbHelper that performed a similar job to the table_info_query. https://github.com/BritishGeologicalSurvey/etlhelper/blob/456f0db009524354d0377621067ba56e8ac441da/etlhelper/db_helpers/postgres.py#L13C4-L13C21