duckdb / duckdb-r

The duckdb R package
https://r.duckdb.org/
Other
121 stars 24 forks source link

parameters passed using dbConnect( config = list() ) are ignored #83

Open SimonCoulombe opened 7 months ago

SimonCoulombe commented 7 months ago

Hi, I saw some code (here) that used config=list() to pass parameters to a duckdb connection.
I tried it, but for me, they are ignored as illustrated below where the memory limit is 432GB instead of the requested 1GB.

Passing parameters using dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';") appears to work.

Maybe config= list() is not a thing, just wanted to make sure:

library(dplyr)
library(dbplyr)
library(duckdb)

duckdb_path <- "/devroot/sandbox/tmp/duckdb.duckdb"

con <- dbConnect(duckdb::duckdb(dbdir = duckdb_path), config=list("memory_limit"="1GB", "threads" = "1") )
dbGetQuery(con, "select current_setting('memory_limit')")  #                         432.6GB
dbGetQuery(con, "select current_setting('threads')")  # 64
dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';")
dbGetQuery(con, "select current_setting('memory_limit')")    # 1GB
dbGetQuery(con, "select current_setting('threads')")   # 1

 Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.0.2 (2020-06-22)
 os       OpenShift Enterprise
 system   x86_64, linux-gnu
 ui       RStudio
 language (EN)
 collate  en_CA.UTF-8
 ctype    en_CA.UTF-8
 tz       America/Toronto
 date     2024-02-23
 rstudio  2023.06.0+421.pro1 Mountain Hydrangea (server)
 pandoc   NA

─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 ! package     * version date (UTC) lib source
 P cli           3.6.2   2023-12-11 [?] RSPM (R 4.0.5)
 P DBI         * 1.2.1   2024-01-12 [?] RSPM (R 4.0.5)
 P dbplyr      * 2.4.0   2023-10-26 [?] RSPM (R 4.0.5)
 P dplyr       * 1.1.4   2023-11-17 [?] RSPM (R 4.0.5)
 P duckdb      * 0.9.2-1 2023-11-28 [?] RSPM (R 4.0.5)
 P fansi         1.0.4   2023-01-22 [?] CRAN (R 4.0.2)
 P generics      0.1.3   2022-07-05 [?] RSPM (R 4.0.5)
 P glue          1.6.2   2022-02-24 [?] RSPM (R 4.0.5)
 P lifecycle     1.0.3   2022-10-07 [?] RSPM (R 4.0.5)
   magrittr      2.0.3   2022-03-30 [1] RSPM (R 4.0.5)
 P pillar        1.9.0   2023-03-22 [?] RSPM (R 4.0.5)
 P pkgconfig     2.0.3   2019-09-22 [?] RSPM (R 4.0.3)
   R6            2.5.1   2021-08-19 [1] RSPM (R 4.0.5)
   renv          1.0.3   2023-09-19 [1] RSPM (R 4.0.2)
 P rlang         1.1.3   2024-01-10 [?] RSPM (R 4.0.5)
 P rstudioapi    0.15.0  2023-07-07 [?] RSPM (R 4.0.5)
 P sessioninfo   1.2.2   2021-12-06 [?] RSPM (R 4.0.5)
 P tibble        3.2.1   2023-03-20 [?] RSPM (R 4.0.5)
 P tidyselect    1.2.0   2022-10-10 [?] RSPM (R 4.0.5)
 P utf8          1.2.3   2023-01-31 [?] RSPM (R 4.0.5)
 P vctrs         0.6.5   2023-12-01 [?] RSPM (R 4.0.5)

[1] xxxxxxx /renv/library/R-4.0/x86_64-pc-linux-gnu
 [2] /opt/R/4.0.2/lib/R/library

 P ── Loaded and on-disk path mismatch.
nbc commented 7 months ago

Parameters must be passed to duckdb OR to dbConnect, not the two :


> library(duckdb)
> 
> con <- dbConnect(duckdb(dbdir = "tpch.db"),
+                  config = list("memory_limit" = "3G")
+                 )
> dbGetQuery(con, "select current_setting('memory_limit')")
  current_setting('memory_limit')
1                          26.6GB
> 
> con <- dbConnect(duckdb(
+   dbdir = "tpch.db",
+   config = list("memory_limit" = "3G")
+   ),
+ )
> dbGetQuery(con, "select current_setting('memory_limit')")
  current_setting('memory_limit')
1                           3.0GB
> 
> con <- dbConnect(duckdb(),
+                  dbdir = "tpch.db",
+                  config = list("memory_limit" = "3G")
+ )
> dbGetQuery(con, "select current_setting('memory_limit')")
  current_setting('memory_limit')
1                           3.0GB
krlmlr commented 7 months ago

Thanks. It's a bit messy right now, the config passed to dbConnect() will only be honored when this function creates a new duckdb server; otherwise it will be silently ignored.

I need to understand the scope at which this configuration applies to be able to propose something better.

krlmlr commented 7 months ago

For now, it's safest to set config in duckdb::duckdb() .

krlmlr commented 7 months ago

@Tmonster: This is a follow-up to https://github.com/duckdb/duckdb-r/pull/73#issuecomment-1963785824 . At the level of the C++ wrapping code, I don't understand the purpose of the DBWrapper struct. Why do both DBWrapper and ConnWrapper exist, is this still necessary today? Or could the logic of rapi_startup() be moved to rapi_connect() ?

CC @hannes.

krlmlr commented 6 months ago

This looks much better in #124, I'll merge it today, binaries will be available on https://duckdb.r-universe.dev/duckdb# soon. Can you confirm?

krlmlr commented 6 months ago

Spoke too soon. This will remain a "feature": config and read_only are only valid when instantiating the database object, which can be during the initial duckdb() call, or during dbConnect() if the dbdir argument is different. This is because config and read_only are defined for the database object, which can host multiple connection objects; each DuckDB file can have at most one database object associated with it.

The cleanest way forward would be to deprecate dbConnect(dbdir, config, read_only) and allow these only in the duckdb() call: #126.

nicki-dese commented 5 months ago

hi @krlmlr - as requested in #56, I ran gc() before reconnecting to the connection with read_only = F, and that fixed the issue.

krlmlr commented 5 months ago

Thanks for the heads-up. I'd consider using gc() to achieve this behavior "off-label use". To truly fix this, please instantiate two driver objects with duckdb() and duckdb(read_only = TRUE), respectively. Now, one of those calls might not work until you call gc(), but it might as well work right away -- not sure.

nicki-dese commented 5 months ago

I'm not sure how likely this scenario will be - unless you decide half way through a session, 'oh - I do need to change that table!' - but it's still nice to know there's a reason for the behaviour and a work around.