Open Tmonster opened 1 year ago
Yeah, I feel your pain here. After doing these benchmarks we found that we were often spending ~70% of our time reading parquet while on cloud with Dask. Our parquet reader was far slower than we expected. However, rather than switch to some more Python-bespoke-file-type, like Pickle files, we're choosing to focus our efforts on improving our parquet reading. I guess we could switch to pickle files, but we never see users use this in practice (everyone seems to use Parquet today) so this change would only be to make us look better, and not to help improve user experience.
Do most DuckDB users use the native storage format, or do most DuckDB users use Parquet? My guess is the latter, and if so I'm inclined to treat Parquet as just part of the benchmark. Otherwise this seems like over-optimization.
What are your thoughts?
Currently every solution seems to execute the queries on parquet files. In the benchmark video, it is noted that duckdb runs on a different instance type to match the number of cores a distributed system like Dask can use with multiple machines. In these two scenarios the same data format/storage is used, which is partitioned tpch parquet files. While duckdb has a good parquet file reader, our FileSystem manager isn't great at handling large batches of parquet files and splitting the work between them.
If the duckdb native storage format is used, I imagine Duckdb will have improved performance for the following two reasons.
Duckdb has a benchmark runner for tpch, which we run on parquet files and on our own storage format. Here are timing results on my MacBook M1 with 16GB ram in a noisy environment. In this scenario we are just using one parquet file per table.
using duckdb native storage
reading from parquet files
If it's not a problem with you, could I change the data generation script to generate a duckdb database with data along with the parquet files? I will also modify the
test-duckdb.py
script to automatically attach to the duckdb database for every query.