coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
34 stars 17 forks source link

Use the duckdb native storage format for tpch? #1194

Open Tmonster opened 1 year ago

Tmonster commented 1 year ago

Currently every solution seems to execute the queries on parquet files. In the benchmark video, it is noted that duckdb runs on a different instance type to match the number of cores a distributed system like Dask can use with multiple machines. In these two scenarios the same data format/storage is used, which is partitioned tpch parquet files. While duckdb has a good parquet file reader, our FileSystem manager isn't great at handling large batches of parquet files and splitting the work between them.

If the duckdb native storage format is used, I imagine Duckdb will have improved performance for the following two reasons.

  1. DuckDB can read its own native storage format much faster
  2. DuckDB can parallelize queries much better when executing queries on its own storage

Duckdb has a benchmark runner for tpch, which we run on parquet files and on our own storage format. Here are timing results on my MacBook M1 with 16GB ram in a noisy environment. In this scenario we are just using one parquet file per table.

using duckdb native storage

benchmark/tpch/sf1/q01.benchmark    1   0.036090
benchmark/tpch/sf1/q01.benchmark    2   0.034834
benchmark/tpch/sf1/q01.benchmark    3   0.036427
benchmark/tpch/sf1/q01.benchmark    4   0.035540
benchmark/tpch/sf1/q01.benchmark    5   0.035005
benchmark/tpch/sf1/q02.benchmark    1   0.011059
benchmark/tpch/sf1/q02.benchmark    2   0.019951
benchmark/tpch/sf1/q02.benchmark    3   0.016276
benchmark/tpch/sf1/q02.benchmark    4   0.009785
benchmark/tpch/sf1/q02.benchmark    5   0.013669
benchmark/tpch/sf1/q03.benchmark    1   0.019149
benchmark/tpch/sf1/q03.benchmark    2   0.020070
benchmark/tpch/sf1/q03.benchmark    3   0.021505
benchmark/tpch/sf1/q03.benchmark    4   0.022160
benchmark/tpch/sf1/q03.benchmark    5   0.021363
benchmark/tpch/sf1/q04.benchmark    1   0.040874
benchmark/tpch/sf1/q04.benchmark    2   0.042987
benchmark/tpch/sf1/q04.benchmark    3   0.044811
benchmark/tpch/sf1/q04.benchmark    4   0.048215
benchmark/tpch/sf1/q04.benchmark    5   0.041317
benchmark/tpch/sf1/q05.benchmark    1   0.023181
benchmark/tpch/sf1/q05.benchmark    2   0.020773
benchmark/tpch/sf1/q05.benchmark    3   0.021888
benchmark/tpch/sf1/q05.benchmark    4   0.028327
benchmark/tpch/sf1/q05.benchmark    5   0.022201
benchmark/tpch/sf1/q06.benchmark    1   0.007796
benchmark/tpch/sf1/q06.benchmark    2   0.008448
benchmark/tpch/sf1/q06.benchmark    3   0.009078
benchmark/tpch/sf1/q06.benchmark    4   0.010069
benchmark/tpch/sf1/q06.benchmark    5   0.009147
benchmark/tpch/sf1/q07.benchmark    1   0.059696
benchmark/tpch/sf1/q07.benchmark    2   0.061615
benchmark/tpch/sf1/q07.benchmark    3   0.055795
benchmark/tpch/sf1/q07.benchmark    4   0.055994
benchmark/tpch/sf1/q07.benchmark    5   0.055798

reading from parquet files

benchmark/tpch/sf1-parquet/q01.benchmark    1   0.111159
benchmark/tpch/sf1-parquet/q01.benchmark    2   0.105748
benchmark/tpch/sf1-parquet/q01.benchmark    3   0.107338
benchmark/tpch/sf1-parquet/q01.benchmark    4   0.109066
benchmark/tpch/sf1-parquet/q01.benchmark    5   0.112037
benchmark/tpch/sf1-parquet/q02.benchmark    1   0.024903
benchmark/tpch/sf1-parquet/q02.benchmark    2   0.026837
benchmark/tpch/sf1-parquet/q02.benchmark    3   0.026377
benchmark/tpch/sf1-parquet/q02.benchmark    4   0.026244
benchmark/tpch/sf1-parquet/q02.benchmark    5   0.019998
benchmark/tpch/sf1-parquet/q03.benchmark    1   0.075689
benchmark/tpch/sf1-parquet/q03.benchmark    2   0.071741
benchmark/tpch/sf1-parquet/q03.benchmark    3   0.076272
benchmark/tpch/sf1-parquet/q03.benchmark    4   0.076845
benchmark/tpch/sf1-parquet/q03.benchmark    5   0.080394
benchmark/tpch/sf1-parquet/q04.benchmark    1   0.075559
benchmark/tpch/sf1-parquet/q04.benchmark    2   0.074212
benchmark/tpch/sf1-parquet/q04.benchmark    3   0.076235
benchmark/tpch/sf1-parquet/q04.benchmark    4   0.085448
benchmark/tpch/sf1-parquet/q04.benchmark    5   0.080268
benchmark/tpch/sf1-parquet/q05.benchmark    1   0.067151
benchmark/tpch/sf1-parquet/q05.benchmark    2   0.068035
benchmark/tpch/sf1-parquet/q05.benchmark    3   0.072970
benchmark/tpch/sf1-parquet/q05.benchmark    4   0.071935
benchmark/tpch/sf1-parquet/q05.benchmark    5   0.070636
benchmark/tpch/sf1-parquet/q06.benchmark    1   0.069849
benchmark/tpch/sf1-parquet/q06.benchmark    2   0.071157
benchmark/tpch/sf1-parquet/q06.benchmark    3   0.068990
benchmark/tpch/sf1-parquet/q06.benchmark    4   0.070346
benchmark/tpch/sf1-parquet/q06.benchmark    5   0.067407
benchmark/tpch/sf1-parquet/q07.benchmark    1   0.114293
benchmark/tpch/sf1-parquet/q07.benchmark    2   0.110458
benchmark/tpch/sf1-parquet/q07.benchmark    3   0.110913
benchmark/tpch/sf1-parquet/q07.benchmark    4   0.109011
benchmark/tpch/sf1-parquet/q07.benchmark    5   0.107592

If it's not a problem with you, could I change the data generation script to generate a duckdb database with data along with the parquet files? I will also modify the test-duckdb.py script to automatically attach to the duckdb database for every query.

mrocklin commented 1 year ago

Yeah, I feel your pain here. After doing these benchmarks we found that we were often spending ~70% of our time reading parquet while on cloud with Dask. Our parquet reader was far slower than we expected. However, rather than switch to some more Python-bespoke-file-type, like Pickle files, we're choosing to focus our efforts on improving our parquet reading. I guess we could switch to pickle files, but we never see users use this in practice (everyone seems to use Parquet today) so this change would only be to make us look better, and not to help improve user experience.

Do most DuckDB users use the native storage format, or do most DuckDB users use Parquet? My guess is the latter, and if so I'm inclined to treat Parquet as just part of the benchmark. Otherwise this seems like over-optimization.

What are your thoughts?