Evaluate DuckDB v0.6.0 experimental parallel CSV data reader and unoderdered insertion

RandomFractals / chicago-crimes

Exploring Chicago crimes dataset with Jupyter notebooks, DuckDB, Malloy and new Panel/PyScript data and dashboard tools.

GNU Affero General Public License v3.0

38 stars 4 forks source link

Evaluate DuckDB v0.6.0 experimental parallel CSV data reader and unoderdered insertion #26

Closed RandomFractals closed 1 year ago

RandomFractals commented 1 year ago

to speed up CSV data loading with DuckDB in chicago-crimes-duckdb.ipynb example notebook created in #4.

See DuckDB v0.6.0 update notes: https://duckdb.org/2022/11/14/announcing-duckdb-060.html

Use SET preserve_insertion_order=false to enable unordered insertion.

Use SET experimental_parallel_csv=true for multi-threaded CSV data loading.

RandomFractals commented 1 year ago

Using those flags brings DuckDB CSV data loading on par with PyArrow & Polars. It used to take about 16s to load that data without multi-threaded CSV data reader (#4)

See updated pic and other sections with data loading timings in docs:

https://github.com/RandomFractals/chicago-crimes#with-duckdb

chicago-crimes-with-duckdb