Closed davidsteinar closed 9 months ago
Hello, I guess this answer is a year late so it's not that relevant anymore... that said:
In general, if you 1) want to do interactive analytics with lots of iterations 2) have enough storage 3) can wait for the full data set to load (which doesn't take very long but still)
you should load the data into the database first. You'll have better statistics.
If you have 1) limited space 2) queries that only target a few columns of the data set
then run the queries on Parquet. They still have some statistics so your query plans will not be too bad, and reading from Parquet is fast.
Hi, I was benchmarking duckdb on a 3,5M row dataset stored as 100 parquet files totalling 4 Gb.
First instinct was to load the data into duckdb and then do the query:
the resulting time is 4.52 seconds.
When doing the same operation in ":memory:"
the resulting time is slightly more, 5.83 seconds.
What is the recommended approach, to always do
FROM read_parquet('citibikedata/*.parquet')
for every query for a slightly slower query speed but with the benefit of not having essentially double storage costs? The documentation does not mention this.