Add additional data wrangling methods

AdrianAntico commented 1 year ago

Thank you DuckDB team for keeping this benchmark going!!!

I see there are a lot of variations on group bys and joins, however, I think it would be highly beneficial to incorporate additional data wrangling methods. A few that come to mind, but others should add to this list, includes:

Unions
Subsetting data
Sampling data
Rolling joins (see data.table)
Pivots long and wide
Rolling / windowing operations by groups over time, such as lags and moving averages
Differencing data by groups based on a time column
Updating records in a data frame / table
Categorical encoding methods: target encoding, James-Stein encoding
Column type conversions

I believe a broader set of operations serves a several purposes. For one, I would like to know if a particular framework can actually do the operation. Secondly, I would like to see benchmarks on their performance. Lastly, I think it would a huge community benefit to see what the actual code ends up looking like to get the greatest performance, which isn't always available through documentation or stackoverflow.

Thanks in advance, Adrian

Tmonster commented 1 year ago

Hi Adrian,

Thank you for the suggestions. Similar suggestions have been mentioned before on the original repository https://github.com/h2oai/db-benchmark/issues/175 and also here https://github.com/duckdblabs/db-benchmark/issues/3. Those suggestions mostly cover pivot/unpivot/long-to-wide/wide-to-long.

I am happy to look into adding some of these benchmarks since they are repeatedly suggested, but since I have other obligations at DuckDBLabs, I cannot be confident about when I would eventually get to them. However, if you would like to open a PR and collaborate with previously mentioned issue authors, I will happily review the PR and merge it!

A WIP PR with just SQL queries for pivot/unpivot would be a great place to start I think. Since this repository isn't too cluttered and the idea seems to be popular, I imagine it would invite some collaboration. The SQL queries could then easily translate to all systems that support SQL (duckdb, clickhouse, spark)

AdrianAntico commented 1 year ago

@Tmonster I have a ton of data.table code assembled in various packages. I could put those together rather quickly. I recall putting something together for DuckDB in comment in the previous repo for lags and moving averages. I can also scrounge that up.

Tmonster commented 1 year ago

Hi Adrian,

It looks like the original author Jan Gorecki opened up a PR with rolling functions. Was this your comment from the previous repo?

Feel free to continue on this PR. I can then add another tab next to group by & join to show rolling functions. We can start with just a few queries as well. As interest grows slowly we can add a few more

jangorecki commented 1 year ago

I opened draft PR so we could work out the scope of that task. Each task needs to be carefully designed, so we need separate issue (PR draft even better as it can have code already) for each new task to discuss it's scope. I haven't received any feedback there (yet). I invite any party interested in this task to look into it, comment, or LGTM so we can proceed with design of it.

AdrianAntico commented 1 year ago

@Tmonster Yes, I was interested in this back then and still to this day. This request can make this repo more of a one stop shop for not only benchmarking but also to help others navigate between frameworks more easily and utilizing optimal code (versus what they may happen to find on stackoverflow).

era127 commented 1 year ago

I think adding a rolling or asof join, similar to this asof benchmark test, would be helpful for timeseries work. For duckdb, I believe the performance has significantly improved in this benchmark with the new asof join support.

start <- as.POSIXct("2020-01-01")
end <- as.POSIXct("2021-01-01")
# calendar with record every minute
calendar <- data.frame(date = seq(from = start, to = end, by = "+1 min"))
N <- 1e3
# table of data
data <- data.frame(date = start + as.difftime(seq(from=10000, to=NROW(calendar), length.out=N)  , units = "mins"), value = seq(0,1,length.out=N))

# xts
calendar.xts <- xts::xts(, order.by = calendar[["date"]])
data.xts <- xts::xts(data[["value"]], order.by = data[["date"]])

# data.table
data.table::setDTthreads(8)
data.dt <- data.table::data.table(data , key = 'date')
calendar.dt = data.table::data.table(calendar, key = 'date')

# duckdb
conn <- DBI::dbConnect(duckdb::duckdb( ))
DBI::dbWriteTable(conn, 'data', data)
DBI::dbWriteTable(conn, 'calendar', calendar)
# duckdb as virtual tables
duckdb::duckdb_register( conn, "data_v", data)
duckdb::duckdb_register( conn, "calendar_v", calendar)

microbenchmark::microbenchmark(
  # xts
  merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)), join = "left"),
  # data.table
  data.dt[calendar.dt, on = "date", roll = TRUE],
  # basic dplyr
  dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >= date))),
  # duckdb as native table
  DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data USING(date)"),
  # duckdb as virtual table
  DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data_v USING(date)")
)

> Unit: milliseconds
>                                                                                  expr       min        lq      mean    median       uq       max neval
>  merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)),      join = "left")  6.967171  7.381661  9.022994  7.522721 11.87473  15.49173   100
>                                        data.dt[calendar.dt, on = "date", roll = TRUE] 26.923634 27.432936 29.005822 27.736459 29.99972  35.15939   100
>     dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >=      date))) 36.935875 37.654380 50.074952 42.531821 44.32565 183.35196   100
>       DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data USING(date)") 13.386008 15.339514 17.653180 16.566275 19.68496  25.52381   100
>     DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data_v USING(date)") 12.720578 15.155957 18.751557 16.164065 19.65825 171.19706   100

# InMemoryDatasets (juliads)
using Dates, InMemoryDatasets, BenchmarkTools
start_ = Dates.DateTime("2020-01-01T00:00:00")
end_ = Dates.DateTime("2021-01-01T01:00:00")
d = start_:Dates.Minute(1):end_
# calendar every minute
calendar = InMemoryDatasets.Dataset(date=collect(d))
issorted!(calendar, :date)
N = Int(1e3)
data = InMemoryDatasets.Dataset(date=unique(start_ .+ Dates.Minute.(trunc.(range(10000, Dates.Minute(end_ - start_).value, N)))), value=range(0, 1, N))
issorted!(data, :date)
@benchmark InMemoryDatasets.closejoin(calendar, data, on=:date)

> BenchmarkTools.Trial: 283 samples with 1 evaluation.
>  Range (min … max):  17.076 ms …  19.199 ms  ┊ GC (min … max): 0.00% … 8.70%
>  Time  (median):     17.450 ms               ┊ GC (median):    0.00%
>  Time  (mean ± σ):   17.719 ms ± 608.662 μs  ┊ GC (mean ± σ):  2.06% ± 3.16%

jangorecki commented 1 year ago

@rdavis120 I think it is a good idea. IMO it make sense to create new issue and discuss scope of "rolling/asof join" task there. This issue is so generic that discussing details won't work well here. Just linking from here will be sufficient.

jangorecki commented 1 year ago

Looking for a feedback about rolling functions inside #9. Whoever is interested in this benchmark task please subscribe to that PR.

duckdblabs / db-benchmark

Add additional data wrangling methods #6