Recommended Benchmarking Arrangement

hkpeaks commented 1 year ago

I would like to thank the DuckDB team for keeping this benchmark going! It can be expensive to run benchmark tests for many software. As a single developer working on my current dataframe project, I encounter problems if I spend too much time learning how to configure other software to do benchmark.

However, I suggest that if each software developer writes their own script, the results can be more fair as different script settings will result in different performance. The software developers must be able to configure their script to get optimized performance.

It is unbelievable that none of the software of the benchmark can complete the JoinTable test.

I use a 67 GB csv fact table of 1 billion rows x 14 columns to join a master of 99696 rows, full join for 3-composit key, returning a 90GB csv file. Peaks can complete the Jointable using only 32GB memory. You can see https://github.com/hkpeaks/peaks-consolidation/tree/main/Polars-PeaksBenchmarking for more information.

You can use the Peaks pre-release to see if it can complete Jointabe on your much better machine. https://github.com/hkpeaks/peaks-consolidation/releases . I want to know the results as I do not have a Google Cloud account. I am exploring which cloud service can accept PayPal for prepayment. Using credit card for online payment is very risky for me.

I suggest that your coming benchmark can be tested on the following categories by selecting all records except filter test:

Select all records for Sorting test for csv and parquet (cover 3+ sorting columns)
Select all records for Groupby test for csv and parquet (cover 3+ GroupBy columns, 3+ Aggregate Column)
Select all records for Distinct test for csv and parquet (cover 3+ Distinct column)
Select all records for JoinTable test for csv and parquet (cover 3+ Composit Key)
Different compare operators filter/unmatch filter test for csv and parquet (cover at least filtered 50%+ volume from input data)

This way it can be more easily understood by business users.

jangorecki commented 1 year ago

Fact to dimensions joins are pretty different degree of complexity than big to big join, where join column (even after distinct) does not fit into memory, and algorithm has to do merge in batches and then reduce the batches to single results. I am not sure if join task stresses that well enough, but that's the aim of q5. Moreover, users dont query csv data, they load them once and then query loaded data multiple times, so having tasks designed for read csv and single query are rather uncommon.

hkpeaks commented 1 year ago

For a batch of query job, obviously users will configure to output reports as much as possible, it is not a problem. Normally they will request a web interface rather then Windows/Linux CLI. e.g. https://youtu.be/6hwbQmTXzMc

e.g. for a retail business group, there are a massive volumn of invoices, fact table contains item code, quantity..., fact table does not have unit price and exchange rate. When an accountant need to prepare ad-hoc reports based on specific condition, e.g. Select transactions which represent 80% of sales revenues, so the process must included joining fact table with unit price and exchange rate in order to get total amount for each invoice. I had worked for accountancy over 2 decades, so I understand the painfulness of accountants to deal with data. Every working day are doing routine and ad-hoc reports. Now I am working as a programmer, so I have solved the performance issues of full join no matter it is running in <=32GB desktop PC for billions of rows.

Tmonster commented 1 year ago

Hi @hkpeaks, Thank you for the suggestions.

If you have any questions about the scripts to run the benchmark, feel free to ask them with reproducible steps. As far as I know, the script in _utils/repro.sh should install the necessary tools to run this benchmark. You will, however, need to install each solution yourself following the {solution}/install.sh scripts provided.

As for every developer writing their own script to get the best performance, developers are encouraged to modify their groupby-{solution} and join-{solution} scripts to get the best performance.

For the Join Table test, yes it is interesting, but as Jan said big to big joins can be complex and stress the memory limits of a system, and it seems like every solution has trouble joining to 50GB files.

If you would like to open a PR to add the peaks solution to the repro, I would be happy to review it. You can use the work done in https://github.com/duckdblabs/db-benchmark/pull/12 and https://github.com/duckdblabs/db-benchmark/pull/11 as a template

hkpeaks commented 1 year ago

H2O script is very complex for me and users so I prefer to write simple script for benchmarking. Coming benchmarking I plan to compare 3,000 files (total 300 Million Rows) for software Polars, DuckDB, Peaks, R Data.Table. But I concern whehter the script I write for third-parties software is a optimized script for performance. So you can help to review the below DuckDB script whether it is optimal.

Visitors of this page can recommend more software to be included in the benchmarking given that they provide relevant scripts.

Test data: https://github.com/hkpeaks/peaks-consolidation/blob/main/Input/1.csv

My current test for DuckDB is only 22 second.
Spark (fail to run in my Windows 11 propertly, I suspect the JAVA VM conflict with Windows 11, and it is not resonable the app requires to open my Port - very high risk to attract hacker). So I remove Spark for comparison. And try to request experience Databrick user test it --- how many node it can be faster than number one of the benchmark?
Polars is planning bug fix. https://github.com/pola-rs/polars/issues/9201 Noticed it is fixed https://github.com/pola-rs/polars/pull/9251
Seem Data.Table does not support larger than memory becuase my current test trigger out of memory "Error: cannot allocate vector of size 1.1 Gb Execution halted"
Considering to include Tableau Hyper. But concern it does not supports script, must use GUI to click each function.
Pandas is not support streaming https://github.com/pandas-dev/pandas/issues/53249
My Peaks is extending its streaming engine to cover many files scenario (currently support single billion-row file), and also need to add new composit function Filter2GroupBy, it can save substaintial memory and CPU resources.

**** DuckDB ****

import duckdb

import time

s = time.time()

con = duckdb.connect()

con.execute("""copy (SELECT Ledger, Account, DC, Currency, SUM(Base_Amount) as Total_Base_Amount FROM read_csv_auto('input/3000Files/*.csv') WHERE Ledger>='L30' AND Ledger <='L70' GROUP BY Ledger, Account, DC, Currency) to 'output/DuckFilterGroupByCSV.csv' (format csv, header true);""")

e = time.time()

print("DuckDB FilterGroupBy CSV Time = {}".format(round(e-s,3)))

**** Polars ****

import polars as pl import time import pathlib s = time.time()

table1 = ( pl.scan_csv("Input/3000Files/*.csv") .filter((pl.col('Ledger') >= "L30") & (pl.col('Ledger') <= "L70"))
.groupby(by=["Ledger", "Account", "DC","Currency"]) .agg([
pl.sum('Base_Amount').alias('Total_Base_Amount'),
]))

path = "Output/PolarsFilterGroupByCSV.csv" table1.lazy().collect(streaming=True).write_csv(path)

e = time.time() print("Polars FilterGroupBy CSV Time = {}".format(round(e-s,3)))

**** Data.Table ****

library(data.table) library(readr)

s <- Sys.time()

setDTthreads(10)

temp <- list.files ("D:/Benchmark/Input/3000Files", full.names = TRUE, pattern = "\.csv$") DT <- rbindlist(lapply(temp, fread), fill = TRUE) DT <- DT[Ledger >= 'L30' & Ledger <= 'L70', .(Total_Base_Amount = sum(Base_Amount)), by = .(Ledger, Account, DC, Currency)] fwrite(DT, 'output/DataTableGroupByCSV.csv')

e <- Sys.time() print(paste0("R-Data.Table FilterGroupBy CSV Time = ", round(e-s,3)))

**** Peaks Consolidation ****

Select{D:/Benchmark/Input/3000Files/*.csv | Ledger(L30..L70)} GroupBy{Ledger, Account, DC,Currency => Sum(Base_Amount) ~ PeaksResult.csv}

In fact I don't understand why many open source software requires users to fill-in excessive settings to do simple task. My designed script is only request users to fill-in essential information. Peaks may not able to win number one from this benchmark, but its script is simplex one. DuckDB recorded 22 seconds for 3000 files with total 300 million rows is very challenging for me.

Tmonster commented 1 year ago

DuckDB recorded 22 seconds for 3000 files with total 300 million rows is very challenging for me.

I'm not sure what you want me to do here. I can say that the duckdb code looks fine, and I am happy to hear that it finishes in a timely manner.

The scope of issues here is db-benchmark issues/feature requests/additional solutions etc. It seems like you would like to add peaks to the benchmark, and potentially add more benchmarking categories.

I don't plan on writing the infrastructure to add peaks to the benchmark like the project is not popular enough yet for me to invest the time and effort. For the additional benchmarks, similar to what Jan Gorecki said, the requested benchmark questions aren't common workloads, so I won't be investing the time to add them to the benchmark.

If there's anything else I can help you with that is an issue or feature request, let me know, otherwise I will close this issue as not planned

hkpeaks commented 12 months ago

I have tested the latest version of Polars 0.19.1 which can process 10 billion-row jointable with file size of 231GB (Output 389GB). Please consider using sink_csv function for your next benchmarking exercise.

import polars as pl from time import time from datetime import datetime

start_time = datetime.now()

master = pl.scan_csv("Inbox/Master.csv")

fact_table = pl.scan_csv("Inbox/10000M-Fact.csv")

result = fact_table.join(master, on=["Product","Style"], how="inner").with_columns(( pl.col("Quantity") * pl.col("Unit_Price")).alias("Amount"))

result.sink_csv("Outbox/PolarsJoinResult.csv")

elapsed = datetime.now() - start_time print(f"\nPolars InnerJoin Duration (in second): { elapsed.total_seconds():.3f}")

duckdblabs / db-benchmark

Recommended Benchmarking Arrangement #13