Closed ywhcuhk closed 6 days ago
Thanks, good question.
I've started to add the following content to README.Rmd
but am not sure where it really belongs. A new FAQ section perhaps?
The goal of duckplyr is to provide a fully compatible drop-in replacement for dplyr that uses DuckDB as a backend for fast operation. DuckDB is an in-process OLAP database management system. Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way. If dplyr and duckplyr give different results on the same input data, this is considered a bug!
The operators, functions, and data types that currently cannot be translated to duckdb are executed in R. The set of supported operations is continuously expanding.
This high-level blog post is a good intro too: https://posit.co/blog/duckplyr-dplyr-powered-by-duckdb/ .
Thanks a lot for the clarification. Just to check, when you say:
Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way.
Does this mean that when run into a function that is not available in duckdb
, the data will be collected and the processed in RAM? One major advantage of using duckdb
is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.
True. You can set the DUCKPLYR_FORCE
environment variable to "TRUE"
to avoid this, but many operations will be unavailable then.
One major advantage of using duckdb is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.
I came here to seek this specific point of clarification. My understanding (through intuition and reading docs) is that duckdplyr is strictly for the in-memory DuckDB database and it does not support out-of-core operations.
Is that safe to say? So, if one is working with larger than memory data, they should consider using duckdb("path/to/db-dir")
. Is that correct?
Here's how we're connecting to duckdb:
duckplyr:::create_default_duckdb_connection
#> function() {
#> drv <- duckdb::duckdb()
#> con <- DBI::dbConnect(drv)
#>
#> DBI::dbExecute(con, "set memory_limit='1GB'")
#> DBI::dbExecute(con, paste0("pragma temp_directory='", tempdir(), "'"))
#>
#> duckdb$rapi_load_rfuns(drv@database_ref)
#>
#> for (i in seq_along(duckplyr_macros)) {
#> sql <- paste0('CREATE MACRO "', names(duckplyr_macros)[[i]], '"', duckplyr_macros[[i]])
#> DBI::dbExecute(con, sql)
#> }
#>
#> con
#> }
#> <bytecode: 0x1135cd480>
#> <environment: namespace:duckplyr>
Created on 2024-05-07 with reprex v2.1.0
The memory is limited, we enable the temporary directory. We also support processing from and to files with duckplyr::*df_from_*()
and duckplyr::df_to_parquet()
.
What do you mean by "support out-of-core operations"?
Thanks @krlmlr. I'm referring to this part of the DuckDB documentation
My understanding is that by only being able to use the default driver :memory:
, the entire database is stored in RAM and does not allow for larger than memory workloads.
As I understand it, one of the main motivating points of DuckDB is this capability.
This is interesting, I am also playing with both packages. I have noticed some difference in performance between these two libraries. I found out that dbplyr
was a bit faster compared to duckplyr
when querying parquet files on S3.
Did I miss something when exploring duckplyr
?
Thanks, Philippe, interesting. With current duckplyr, I'm seeing that the filter is not pushed down to Parquet. Could that play a role? What does the plan look like for dbplyr?
options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(DBI)
con <- duckplyr:::get_default_duckdb_connection()
dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result e60e0 connection=e5160 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(
con,
"SET s3_region='auto';SET s3_endpoint='';"
)
#> <duckdb_result e63e0 connection=e5160 statement='SET s3_region='auto';SET s3_endpoint='';'>
out <- duckplyr::duckplyr_df_from_file(
"s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet",
"read_parquet",
options = list(hive_partitioning = TRUE),
class = class(tibble())
) |>
filter(total_amount > 0L) |>
filter(!is.na(passenger_count)) |>
mutate(tip_pct = 100 * tip_amount / total_amount) |>
summarise(
avg_tip_pct = median(tip_pct),
n = n(),
.by = passenger_count
) |>
arrange(desc(passenger_count))
out |>
explain()
#> ┌───────────────────────────┐
#> │ ORDER_BY │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ ORDERS: │
#> │ #3 ASC │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ avg_tip_pct │
#> │ n │
#> │ -(passenger_count) │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ avg_tip_pct │
#> │ n │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ HASH_GROUP_BY │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ #0 │
#> │ median(#1) │
#> │ count_star() │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ tip_pct │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ tip_pct │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ tip_amount │
#> │ total_amount │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ FILTER │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │(r_base::>(total_amount, 0)│
#> │ AND (NOT ((passenger_count│
#> │ IS NULL) OR isnan(CAST │
#> │(passenger_count AS DO... │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ EC: 17623488 │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ READ_PARQUET │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ total_amount │
#> │ passenger_count │
#> │ tip_amount │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ EC: 88117440 │
#> └───────────────────────────┘
Created on 2024-05-07 with reprex v2.1.0
Josiah: duckplyr operates directly on data frames, it never creates persistent tables in duckdb's table store. The location of the database doesn't play that much of a role. The DBI equivalents are perhaps duckdb::duckdb_register()
(create a view of the data frame in the database) vs. DBI::dbWriteTable()
(write a table to the table store). You can write the results of a duckplyr operation directly to Parquet with df_to_parquet()
without materializing it as a data frame. However, as soon as duckplyr encounters an operation it can't translate, it will fall back to data frames. Set the DUCKPLYR_FORCE
environment variable to avoid this.
Thank you @krlmlr for looking. This is what I have as plan:
library(duckdb)
#> Loading required package: DBI
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
con <- dbConnect(duckdb())
dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result a8340 connection=70be0 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(con, "SET s3_region='auto';SET s3_endpoint='';")
#> <duckdb_result 4dbc0 connection=70be0 statement='SET s3_region='auto';SET s3_endpoint='';'>
df <- tbl(
con,
"read_parquet('s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet')"
)
out <- df |>
filter(total_amount > 0L) |>
filter(!is.na(passenger_count)) |>
mutate(tip_pct = 100 * tip_amount / total_amount) |>
summarise(
avg_tip_pct = median(tip_pct),
n = n(),
.by = passenger_count
) |>
arrange(desc(passenger_count))
explain(out)
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> <SQL>
#> SELECT
#> passenger_count,
#> PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY tip_pct) AS avg_tip_pct,
#> COUNT(*) AS n
#> FROM (
#> SELECT q01.*, (100.0 * tip_amount) / total_amount AS tip_pct
#> FROM (FROM read_parquet('s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet')) q01
#> WHERE (total_amount > 0) AND (NOT((passenger_count IS NULL)))
#> ) q01
#> GROUP BY passenger_count
#> ORDER BY passenger_count DESC
#>
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │ ORDER_BY │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ ORDERS: │
#> │ q01.passenger_count DESC │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ HASH_GROUP_BY │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ #0 │
#> │ quantile_cont(#1) │
#> │ count_star() │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ tip_pct │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ tip_pct │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ FILTER │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ (NOT (passenger_count IS │
#> │ NULL)) │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ EC: 3524697 │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ READ_PARQUET │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ tip_amount │
#> │ total_amount │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Filters: total_amount>0.0 │
#> │ AND total_amount IS N... │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ EC: 17623488 │
#> └───────────────────────────┘
dbDisconnect(con)
Created on 2024-05-07 with reprex v2.1.0
I think you are correct about filter()
not being pushed down to the parquet files.
Thanks, Philippe. Would you like to open a new issue with a reprex about the filter pushdown?
When documenting, need to mention that we never generate SQL: #132.
I find it quite convenient to use
duckdb
as a backend ofdplyr
(throughdbplyr
). All you need to do is to specify aduckdb
connection, and read data throughduckdb
's function. Then you can manipulatedf
usingdplyr
. e.g.,So what can
duckplyr
do thatdbplyr
can't?