ankane / dbx

A fast, easy-to-use database library for R
Other
187 stars 15 forks source link

Upsert with WHERE #31

Closed viebrix closed 1 year ago

viebrix commented 1 year ago

Hi, thanks for this great library. Would it be possible to enhance the lib for postgres with "WHERE ...." in upsert method? Example: INSERT INTO test_upsert as tu (name, status, test_field, identifier, count) VALUES ('peter', 1, 17, 'ident', 1), ('shaun', 1, null, 'ident', 3), ('maria', 1, null, 'ident', 7) ON CONFLICT (name, status) WHERE test_field IS NULL -- the conflicting condition DO UPDATE SET count = tu.count + EXCLUDED.count WHERE -- when to update tu.name = 'shaun' AND tu.status = 1 ; -- if you don't want all of the -- updates to happen Source of Example: https://dba.stackexchange.com/questions/151431/postgresql-upsert-issue-with-null-values

ankane commented 1 year ago

Hey @viebrix, I don't really follow the example or use case. Can you share more context and provide a practical example in R about how it would work?

viebrix commented 1 year ago

Hi @ankane,

thanks for the quick reply. In postgres <15 you are not allowed to create an primary unique constraint/index on a column which treat NULL values as distinct. One possible workaround is to create two partial unique indices. One uses all those columns, which does not allow NULLs. The second uses all columns but filters the NULL values out. See more here (Postgres 14 or older Create two partial indexes)

But filtering out all NULL values in the unique constraint, demands also for filtering all NULL values in the ON CONFLICT - Statement. This filtering works with ON CONFLICT ..... WHERE ..... For example (Example from the user in the following question INSERT INTO test_upsert as tu (name, status, test_field, identifier, count) VALUES ('peter', 1, 17, 'ident', 1), ('shaun', 1, null, 'ident', 3), ('maria', 1, null, 'ident', 7) ON CONFLICT (name, status) WHERE test_field IS NULL DO NOTHING

In R and your code, (If I had to enhance the code) I would enhance the function: dbxUpsert <- function(conn, table, records, where_cols, batch_size=NULL, returning=NULL, skip_existing=FALSE) with: dbxUpsert <- function(conn, table, records, where_cols, batch_size=NULL, returning=NULL, skip_existing=FALSE, conflict_condition=NULL)

And later in the function so after line 51, I would add: if(!is.null(conflict_condition)) { sql <- paste0(sql, " WHERE ", conflict_condition, " ") }

Then it would be possible to call the function with: dbxUpsert(con, "public.test_upsert", records, c("name", "status"),skip_existing=TRUE, conflict_condition="test_field IS NULL")

I don't know your code nearly well enough to judge influence of such change. Also, I don't know the consequences for the other databases like mysql, sqlite etc.

Here is also an example for ON CONFLICT with WHERE Condition

Edit: Original purpose is to treat NULL values in columns same as any other value. I have to import excel sheets in my db. My problem is, that those rows in excel can be duplicates of already imported rows. Therefore I'm using upsert to only import them one time and ignore if already imported. Special problem is, those excel data also includes empty cells. But empty means: not used, which is different to the sql interpretation that null means "every possible value". Therefore I have to treat null values as distinct values. But this is only possible in postgres 15> To handle this, I need the WHERE cause in ON CONFLICT cause. But I think having the possibly to use WHERE in ON CONFLICT can also be interesting for other tasks.

(sorry if not everything is clear and understandable, but I'm no english native speaker)

thanks and best regards

viebrix commented 1 year ago

Here an R example what my problem is:

# -- create in Postgres SQL:
# CREATE TABLE public.test_upsert (
# upsert_id serial,
# name character varying(32) NOT NULL,
# status integer NOT NULL,
# test_field text,
# identifier character varying(255),
# count integer,
# CONSTRAINT upsert_id_pkey PRIMARY KEY (upsert_id),
# CONSTRAINT test_upsert_name_status_test_field_key UNIQUE (name, status, test_field)
# );

#in R:

library(dbx)
#con<- dbxConnect(adapter="postgres", dbname="public" ) 
mydata1<-data.frame(name='shaun',status=1,test_field='test value',identifier='ident', count=1)
dbxUpsert(con, "test_upsert", mydata1, where_cols=c("name","status","test_field"), skip_existing=TRUE)

# DB: (inserts row because table is empty) <- Correct
#   upsert_id   name    status  test_field  identifier  count
#   1   shaun   1   test value  ident   1

dbxUpsert(con, "test_upsert", mydata1, where_cols=c("name","status","test_field"), skip_existing=TRUE)

# DB: (skips insert - because row is already in DB) <- Correct
#   upsert_id   name    status  test_field  identifier  count
#   1   shaun   1   test value  ident   1

mydata2<-data.frame(name='shaun',status=1,test_field=NA,identifier='ident', count=2)
dbxUpsert(con, "test_upsert", mydata2, where_cols=c("name","status","test_field"), skip_existing=TRUE)

# DB: (inserts new row which has a null value in test_field) <- Correct
#   upsert_id   name    status  test_field  identifier  count
#   1   shaun   1   test value  ident   1
#   3   shaun   1   NULL    ident   2

mydata3<-data.frame(name='shaun',status=1,test_field=NA,identifier='ident', count=99)
dbxUpsert(con, "test_upsert", mydata3, where_cols=c("name","status","test_field"), skip_existing=TRUE)

# DB: (inserts row, sql conform but for my application wrong because values for where_cols seems to be identic) <- for my purpose wrong
#   upsert_id   name    status  test_field  identifier  count
#   1   shaun   1   test value  ident   1
#   3   shaun   1   NULL    ident   2
#   4   shaun   1   NULL    ident   99

dbxUpsert(con, "test_upsert", mydata3, where_cols=c("name","status","test_field"), skip_existing=TRUE)

# DB: (inserts row, because for sql the null value is every value you can imagine) <- for my purpose wrong
#   upsert_id   name    status  test_field  identifier  count
#   1   shaun   1   test value  ident   1
#   3   shaun   1   NULL    ident   2
#   4   shaun   1   NULL    ident   99
#   5   shaun   1   NULL    ident   99

dbxUpsert(con, "test_upsert", mydata3, where_cols=c("name","status","test_field"), skip_existing=TRUE)

# DB: (inserts row again and again, because for sql the null value is every value you can imagine) <- for my purpose wrong
#   upsert_id   name    status  test_field  identifier  count
#   1   shaun   1   test value  ident   1
#   3   shaun   1   NULL    ident   2
#   4   shaun   1   NULL    ident   99
#   5   shaun   1   NULL    ident   99
#   6   shaun   1   NULL    ident   99
viebrix commented 1 year ago

Here an example which uses the dbxUpsert - changes from my Draft:

con<-c2_db_apm_connect(use_dbx=T)
# -- create in Postgres SQL:
# CREATE TABLE public.test_upsert1 (
# upsert_id serial,
# name character varying(32) NOT NULL,
# status integer NOT NULL,
# test_field text,
# identifier character varying(255),
# count integer,
# CONSTRAINT upsert1_id_pkey PRIMARY KEY (upsert_id),
# CONSTRAINT test_upsert1_name_status_test_field_key UNIQUE (name, status, test_field)
# );
# CREATE UNIQUE INDEX test_upsert1_upsert_id_idx
# ON public.test_upsert1 (name, status)
# WHERE test_field IS NULL ;

#in R:

library(dbx)
#con<- dbxConnect(adapter="postgres", dbname="public" ) 

# Now test row without NULL/NA

mydata1<-data.frame(name='shaun',status=1,test_field='test value',identifier='ident', count=1)
dbxUpsert(con, "test_upsert1", mydata1, where_cols=c("name","status","test_field"), skip_existing=TRUE,conflict_condition="test_field IS NULL")

# DB: (inserts row because table is empty) <- Correct
#   upsert_id   name    status  test_field  identifier  count
#   1   shaun   1   test value  ident   1
# Now test duplicate without NULL

dbxUpsert(con, "test_upsert1", mydata1, where_cols=c("name","status","test_field"), skip_existing=TRUE,conflict_condition="test_field IS NULL")

# DB: (skips insert - because row is already in DB) <- Correct
#   upsert_id   name    status  test_field  identifier  count
#   1   shaun   1   test value  ident   1

# Now test NULL/NA value:

mydata2<-data.frame(name='shaun',status=1,test_field=NA,identifier='ident', count=2)
dbxUpsert(con, "test_upsert1", mydata2, where_cols=c("name","status","test_field"), skip_existing=TRUE,conflict_condition="test_field IS NULL")

# DB: (inserts new row which has a null value in test_field) <- Correct
#   upsert_id   name    status  test_field  identifier  count
#   1   shaun   1   test value  ident   1
#   3   shaun   1   NULL    ident   2

# Now test with duplicate with NULL/NA value
mydata3<-data.frame(name='shaun',status=1,test_field=NA,identifier='ident', count=99)
dbxUpsert(con, "test_upsert1", mydata3, where_cols=c("name","status","test_field"), skip_existing=TRUE,conflict_condition="test_field IS NULL")

# Error: Failed to fetch row: ERROR:  duplicate key value violates unique constraint "test_upsert1_upsert_id_idx"
# DETAIL:  Key (name, status)=(shaun, 1) already exists.

Seems that it does not work because _where_cols=c("name","status","testfield") . But using _wherecols=c("name","status") seems to work. Problem is only - _wherecols=c("name","status") does not work correct for test_value != NULL/NA I thought postgres would ignore the column, but it does not seem to work. Maybe I have to solve the problem to Split Columns into different tables. So my DRAFT code seems to work without error, but does not solve my problem. Maybe it could be interesting for other problems, to have a WHERE Conflict condition. But I have currently no example for such demand.