OHDSI / Andromeda

AsynchroNous Disk-based Representation of MassivE DAta: An R package aimed at replacing ff for storing large data objects.
https://ohdsi.github.io/Andromeda/
11 stars 10 forks source link

Investigate adding option for a duckdb backend or RSQLite+parquet #12

Open ablack3 opened 3 years ago

ablack3 commented 3 years ago

DuckDB is a file-based database similar to sqlite and has an R interface. It looks like it might provider performance benefits over SQLite.

https://duckdb.org/docs/api/r

Investigate performance differences between DuckDB and SQLite. Look into implementing an Andromeda DuckDB backend.

ablack3 commented 3 years ago

Some links Add parquet support to RSQLite: https://github.com/r-dbi/RSQLite/issues/325 SQLite parquet virtual table: https://github.com/cldellow/sqlite-parquet-vtable Why would this be great? https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html Maybe duckdb would be easier and better route? https://duckdb.org/docs/api/r

Does SQLite cover all of OHDSI SQL? If not, where does it fall short? I'm wondering if duckdb would support all of OHDSI SQL if SQLite does not.

vojtechhuser commented 3 years ago

+1

ablack3 commented 2 years ago

Some simple join tests comparing duckdb and SQLite. According to the duckdb FAQ it has not been optimized for performance yet.

# Test large joins -----------------

n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e6), df1)
df2 <- data.frame(id = sample(1:1e6, 10, replace = T))

### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))

duckdb::dbWriteTable(con, "df1", df1)
duckdb::dbWriteTable(con, "df2", df2)

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    4.92    0.05    5.01

system.time({
  DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
#>    user  system elapsed 
#>    0.40    0.01    0.45

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    2.06    0.05    2.19

duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))

### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))

RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    0.48    1.73    2.50

system.time({
  DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
#>    user  system elapsed 
#>    1.06    2.54    3.75

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
#>    user  system elapsed 
#>    0.45    2.37    2.89

DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))

Created on 2021-10-13 by the reprex package (v2.0.1)

schuemie commented 2 years ago

Thanks, this is very informative! There appears to be a large hit in performance when using DuckDB. And performance is an important aspect of Andromeda.

chrisknoll commented 2 years ago

So, after wrestling with SQLLite and their lack of full fledged date types, single-schema technology limitations, lack of 'real' temp tables, and more, @schuemie pointed me to duckdb and I also found this issue.

I'm trying to create tests for CohortIncidence, and it is heavy on the date functions, and SqLite is failing hard. The DatabaseConnector attempts to work around it are valiant, but probably not sustainable.

I'll be looking to use duckdb in my tests for CohortIncidence, and I'd like to know if it would be OK for me to adopt it and possibly introduce SqlRender rules for translation (although it doesn't seem like we'd need much). Performance is not as critical for me, so even if you wanted to keep Andromeda on SqLite, i'd still appreicate it if we worked to support duckdb as a DatabaseConnector option.

ablack3 commented 2 years ago

The performance hit is unfortunate and surprising since duckdb is supposed to be designed for OLAP and should outperform sqlite in the long run. My guess is that it is because duckdb is still new and sqlite has had a long time to mature. Indexes or use of multi-threading might help. I'm very much in favor of starting to use it in some capacity (maybe for Eunomia) just so we can have date support. I agree the date workarounds are a loosing battle long term.

Here is a related issue: https://github.com/OHDSI/DatabaseConnector/issues/153 @chrisknoll - do you need any changes in Andromeda? I think I have a working Andromeda branch that uses duckdb.

chrisknoll commented 2 years ago

No, I just was hoping to use database connector but it depends on some attribute being present in order to use the query/execute commands. Also, I think there will be some light sqlrender dialect work...but I don't use andromeda directly for anything at the moment.

ablack3 commented 1 year ago

We're goin with arrow!

schuemie commented 11 months ago

FYI: I reran the tests again to see if DuckDB has improved. It has, but SQLite is still the winner. (which is odd, because I get amazing performance out of DuckDB in Python):

# Test large joins -----------------

n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e6), df1)
df2 <- data.frame(id = sample(1:1e6, 10, replace = T))

### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))

duckdb::dbWriteTable(con, "df1", df1)
duckdb::dbWriteTable(con, "df2", df2)

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 1.73    0.03    0.58 

system.time({
  DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 0.19    0.01    0.17 

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 1.62    0.05    0.56 

duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))

### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))

RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 0.11    1.11    1.25 

system.time({
  DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 0.43    1.05    1.50 

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 0.21    1.00    1.20 

DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
ablack3 commented 11 months ago

Good to know. Thanks Martijn. Kind of strange since I'd assume both R and python are calling the same C++ code.

schuemie commented 4 months ago

I reran the 'benchmark', but made the data a bit larger. If you look at the 'user' category of time, SQLite is still superior, but the 'elapsed' category tells a different story. I also timed the data insertion, and there duckdb is the undisputed winner. I'll look into switching the backend to duckdb.

# Test large joins -----------------
options("andromedaTempFolder" ="d:/andromedaTemp")

n_covariates <- 1000
df1 <- as.data.frame(lapply(1:n_covariates, function(n) rbinom(1e6, 1, .2)))
names(df1) <- paste0("covariate", 1:n_covariates)
df1 <- cbind(data.frame(id = 1:1e7), df1)
df2 <- data.frame(id = sample(1:1e7, 10, replace = T))

### duckdb --------------------------
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))
con <- duckdb::dbConnect(duckdb::duckdb(), file.path(getOption("andromedaTempFolder"), "test.duckdb"))

system.time({
  duckdb::dbWriteTable(con, "df1", df1)
  duckdb::dbWriteTable(con, "df2", df2)
})
# user  system elapsed 
# 246.33   17.24  261.17 

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 14.11    0.22    1.70 

system.time({
  DBI::dbExecute(con, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(con, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 5.05   13.64    2.49 

system.time({
  DBI::dbExecute(con, "drop table if exists df3")
  DBI::dbExecute(con, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 14.48    0.06    1.64 

duckdb::dbDisconnect(con, shutdown = TRUE)
unlink(file.path(getOption("andromedaTempFolder"), "test.duckdb"))

### sqlite ---------------
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))
conn <- RSQLite::dbConnect(RSQLite::SQLite(), file.path(getOption("andromedaTempFolder"), "test.sqlite"))

system.time({
RSQLite::dbWriteTable(conn, "df1", df1)
RSQLite::dbWriteTable(conn, "df2", df2)
})
# user  system elapsed 
# 467.03   17.60  485.41 

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 1.63    6.39    8.02 

system.time({
  DBI::dbExecute(conn, "create index idx_df1_id on df1(id);")
  DBI::dbExecute(conn, "create index idx_df2_id on df2(id);")
})
# user  system elapsed 
# 3.61    7.00   10.61 

system.time({
  RSQLite::dbExecute(conn, "drop table if exists df3")
  RSQLite::dbExecute(conn, "create table df3 as select * from df1 inner join df2 on df1.id = df2.id")
})
# user  system elapsed 
# 2.04    6.25    8.30 

DBI::dbDisconnect(conn)
unlink(file.path(getOption("andromedaTempFolder"), "test.sqlite"))