duckdblabs / duckplyr

A drop-in replacement for dplyr, powered by DuckDB for performance.
https://duckdblabs.github.io/duckplyr/
Other
222 stars 12 forks source link

difference between `duckplyr` and `dbplyr`? #145

Closed ywhcuhk closed 6 days ago

ywhcuhk commented 2 months ago

I find it quite convenient to use duckdb as a backend of dplyr (through dbplyr). All you need to do is to specify a duckdb connection, and read data through duckdb's function. Then you can manipulate df using dplyr. e.g.,

library(duckdb)
library(dplyr)
con <- dbConnect(duckdb())

df = tbl(con, "read_parquet('yourdata.parquet')") 

So what can duckplyr do that dbplyr can't?

krlmlr commented 2 months ago

Thanks, good question.

I've started to add the following content to README.Rmd but am not sure where it really belongs. A new FAQ section perhaps?

The goal of duckplyr is to provide a fully compatible drop-in replacement for dplyr that uses DuckDB as a backend for fast operation. DuckDB is an in-process OLAP database management system. Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way. If dplyr and duckplyr give different results on the same input data, this is considered a bug!

The operators, functions, and data types that currently cannot be translated to duckdb are executed in R. The set of supported operations is continuously expanding.

krlmlr commented 2 months ago

This high-level blog post is a good intro too: https://posit.co/blog/duckplyr-dplyr-powered-by-duckdb/ .

ywhcuhk commented 2 months ago

Thanks a lot for the clarification. Just to check, when you say:

Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way.

Does this mean that when run into a function that is not available in duckdb, the data will be collected and the processed in RAM? One major advantage of using duckdb is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.

krlmlr commented 2 months ago

True. You can set the DUCKPLYR_FORCE environment variable to "TRUE" to avoid this, but many operations will be unavailable then.

JosiahParry commented 2 months ago

One major advantage of using duckdb is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.

I came here to seek this specific point of clarification. My understanding (through intuition and reading docs) is that duckdplyr is strictly for the in-memory DuckDB database and it does not support out-of-core operations.

Is that safe to say? So, if one is working with larger than memory data, they should consider using duckdb("path/to/db-dir"). Is that correct?

krlmlr commented 2 months ago

Here's how we're connecting to duckdb:

duckplyr:::create_default_duckdb_connection
#> function() {
#>   drv <- duckdb::duckdb()
#>   con <- DBI::dbConnect(drv)
#> 
#>   DBI::dbExecute(con, "set memory_limit='1GB'")
#>   DBI::dbExecute(con, paste0("pragma temp_directory='", tempdir(), "'"))
#> 
#>   duckdb$rapi_load_rfuns(drv@database_ref)
#> 
#>   for (i in seq_along(duckplyr_macros)) {
#>     sql <- paste0('CREATE MACRO "', names(duckplyr_macros)[[i]], '"', duckplyr_macros[[i]])
#>     DBI::dbExecute(con, sql)
#>   }
#> 
#>   con
#> }
#> <bytecode: 0x1135cd480>
#> <environment: namespace:duckplyr>

Created on 2024-05-07 with reprex v2.1.0

The memory is limited, we enable the temporary directory. We also support processing from and to files with duckplyr::*df_from_*() and duckplyr::df_to_parquet() .

What do you mean by "support out-of-core operations"?

JosiahParry commented 2 months ago

Thanks @krlmlr. I'm referring to this part of the DuckDB documentation

image

My understanding is that by only being able to use the default driver :memory:, the entire database is stored in RAM and does not allow for larger than memory workloads.

As I understand it, one of the main motivating points of DuckDB is this capability.

PMassicotte commented 2 months ago

This is interesting, I am also playing with both packages. I have noticed some difference in performance between these two libraries. I found out that dbplyr was a bit faster compared to duckplyr when querying parquet files on S3.

Did I miss something when exploring duckplyr ?

krlmlr commented 2 months ago

Thanks, Philippe, interesting. With current duckplyr, I'm seeing that the filter is not pushed down to Parquet. Could that play a role? What does the plan look like for dbplyr?

options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(DBI)

con <- duckplyr:::get_default_duckdb_connection()

dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result e60e0 connection=e5160 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(
  con,
  "SET s3_region='auto';SET s3_endpoint='';"
)
#> <duckdb_result e63e0 connection=e5160 statement='SET s3_region='auto';SET s3_endpoint='';'>

out <- duckplyr::duckplyr_df_from_file(
  "s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet",
  "read_parquet",
  options = list(hive_partitioning = TRUE),
  class = class(tibble())
) |>
  filter(total_amount > 0L) |>
  filter(!is.na(passenger_count)) |>
  mutate(tip_pct = 100 * tip_amount / total_amount) |>
  summarise(
    avg_tip_pct = median(tip_pct),
    n = n(),
    .by = passenger_count
  ) |>
  arrange(desc(passenger_count))

out |>
  explain()
#> ┌───────────────────────────┐
#> │          ORDER_BY         │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │          ORDERS:          │
#> │           #3 ASC          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> │     -(passenger_count)    │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       HASH_GROUP_BY       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             #0            │
#> │         median(#1)        │
#> │        count_star()       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │         tip_amount        │
#> │        total_amount       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │(r_base::>(total_amount, 0)│
#> │ AND (NOT ((passenger_count│
#> │   IS NULL) OR isnan(CAST  │
#> │(passenger_count AS DO...  │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 17623488       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       READ_PARQUET        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        total_amount       │
#> │      passenger_count      │
#> │         tip_amount        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 88117440       │
#> └───────────────────────────┘

Created on 2024-05-07 with reprex v2.1.0

krlmlr commented 2 months ago

Josiah: duckplyr operates directly on data frames, it never creates persistent tables in duckdb's table store. The location of the database doesn't play that much of a role. The DBI equivalents are perhaps duckdb::duckdb_register() (create a view of the data frame in the database) vs. DBI::dbWriteTable() (write a table to the table store). You can write the results of a duckplyr operation directly to Parquet with df_to_parquet() without materializing it as a data frame. However, as soon as duckplyr encounters an operation it can't translate, it will fall back to data frames. Set the DUCKPLYR_FORCE environment variable to avoid this.

PMassicotte commented 2 months ago

Thank you @krlmlr for looking. This is what I have as plan:

library(duckdb)
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

con <- dbConnect(duckdb())

dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result a8340 connection=70be0 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(con, "SET s3_region='auto';SET s3_endpoint='';")
#> <duckdb_result 4dbc0 connection=70be0 statement='SET s3_region='auto';SET s3_endpoint='';'>

df <- tbl(
  con,
  "read_parquet('s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet')"
)

out <- df |>
  filter(total_amount > 0L) |>
  filter(!is.na(passenger_count)) |>
  mutate(tip_pct = 100 * tip_amount / total_amount) |>
  summarise(
    avg_tip_pct = median(tip_pct),
    n = n(),
    .by = passenger_count
  ) |>
  arrange(desc(passenger_count))

explain(out)
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> <SQL>
#> SELECT
#>   passenger_count,
#>   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY tip_pct) AS avg_tip_pct,
#>   COUNT(*) AS n
#> FROM (
#>   SELECT q01.*, (100.0 * tip_amount) / total_amount AS tip_pct
#>   FROM (FROM read_parquet('s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet')) q01
#>   WHERE (total_amount > 0) AND (NOT((passenger_count IS NULL)))
#> ) q01
#> GROUP BY passenger_count
#> ORDER BY passenger_count DESC
#> 
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │          ORDER_BY         │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │          ORDERS:          │
#> │  q01.passenger_count DESC │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       HASH_GROUP_BY       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             #0            │
#> │     quantile_cont(#1)     │
#> │        count_star()       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │  (NOT (passenger_count IS │
#> │           NULL))          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 3524697        │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       READ_PARQUET        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │         tip_amount        │
#> │        total_amount       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │ Filters: total_amount>0.0 │
#> │ AND total_amount IS N...  │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 17623488       │
#> └───────────────────────────┘

dbDisconnect(con)

Created on 2024-05-07 with reprex v2.1.0

I think you are correct about filter() not being pushed down to the parquet files.

krlmlr commented 1 month ago

Thanks, Philippe. Would you like to open a new issue with a reprex about the filter pushdown?

krlmlr commented 2 weeks ago

When documenting, need to mention that we never generate SQL: #132.