Open volcan01010 opened 1 year ago
Having looked further into this, it won't be easy to implement. There are two main hurdles:
Here is some code that ChatGPT wrote to demonstrate what the command should be for PostgreSQL. You can see that it explicitly uses the id
command.
ChatGPT was also able to generate equivalent commands for Oracle, SQLite and MS SQL Server. There are two things to note:
MERGE
, where SQLite and PostgreSQL use INSERT ... ON CONFLICT
ON
clause
Here is ChatGPT's suggestion for:
Please write a SQL command that lists the primary key columns for a table in a PostgreSQL database.
SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
SELECT constraint_name
FROM information_schema.table_constraints
WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
);
And the translation to different database systems via:
Can you give equivalent commands for Oracle, SQLite and MS SQL Server databases?
Oracle:
SELECT column_name
FROM all_cons_columns
WHERE constraint_name = (
SELECT constraint_name
FROM all_constraints
WHERE table_name = 'your_table_name' AND constraint_type = 'P'
);
SQLite:
PRAGMA table_info(your_table_name)
WHERE pk > 0;
MS SQL Server:
SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
SELECT constraint_name
FROM information_schema.table_constraints
WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
);
Note that I haven't checked them, and that the MS SQL Server value looks suspiciously similar to the PostgreSQL one.
I think that we could probably do this at some point.
It would be better to implement this as a new function, merge
, rather than updating the load
function as that would better reflect the fact that the underlying mechanism is different.
This would use a function behind the scenes called generate_merge_sql
instead of generate_insert_sql
that is used by load
.
I've used COPY to transfer rows in a db migration; this can be 10,000s or millions. My process is:
In one table there is a composite unqiue key (column_X, column_Y) where if there is a conflict in 1 of the 5000, because it is a single tranaction and the cursor is not commited, none of rows are copied. On a conflict I would like to ignore ON CONFLICT (column_X, column_Y) DO NOTHING
as they have already been migrated (so this is relevant for skip_conflict=True
only.)
You can also skip conflicts using ON CONFLICT ON CONSTRAINT [constraint_name] DO NOTHING
.
Unfortunately you cannot ON CONFLICT (id, column_X, column_Y) ...
as the PK and UQ constraints are independant. Neither can you chain them ON CONFLICT (id) ... ON CONFLICT (column_X, column_Y) ...
With that in mind, if it fits your use cases too it would be great if a copy function could specify a constraint_name
instead of only being a PK. For case witth more than one constraint, it would require the developer to either pre-filter their data beforehand. Or perform some kind of upsert but at that point it might be easier to supply their own SQL statement...
The first task, whatever we do is to determine the primary keys of a table.
The specific queries are given above.
The primary key columns can be added as an extra attribute to the table_info
response.
The current table_info
response looks like:
We would need to add a primary_key_query
to each DbHelper that performed a similar job to the table_info_query
.
https://github.com/BritishGeologicalSurvey/etlhelper/blob/456f0db009524354d0377621067ba56e8ac441da/etlhelper/db_helpers/postgres.py#L13C4-L13C21
Summary
As an ETL Helper user I want to use
skip_conficts
so that I can handle primary key violations at the database level without requiring ETL Helper's error handling.Description
A common error in a big insert is where records already exist for a given primary key. Some databases have internal methods for handling these scenarios. For example SQLite has the
ON CONFLICT
orINSERT
syntax: https://www.sqlite.org/lang_conflict.html. PostgreSQL has ON CONFLICT DO NOTHING: https://www.postgresql.org/docs/current/sql-insert.html#:~:text=ON%20CONFLICT%20DO%20NOTHING%20simply,can%20perform%20unique%20index%20inference.There should be implementations for each database. The specific way in which they are applied will need to be configured in the DbHelper classes. The SQL statement generated for the
load
function may need changes around the INSERT keyword, or as a suffix.Acceptance critera
load
withskip_conflicts=True
means that primary key violations pass silently and rows are unchanged