duckdb / duckdb-web

DuckDB website and documentation
https://duckdb.org
MIT License
161 stars 302 forks source link

Load parquet files into table and then query vs. query from files directly - best practice? #411

Closed davidsteinar closed 9 months ago

davidsteinar commented 1 year ago

Hi, I was benchmarking duckdb on a 3,5M row dataset stored as 100 parquet files totalling 4 Gb.

First instinct was to load the data into duckdb and then do the query:

conn = duckdb.connect("citibike.duckdb")
conn.execute("CREATE TABLE citibike AS SELECT * FROM read_parquet('citibikedata/*.parquet')"
start = time.time()
query = """
            SELECT
                station_id
                , AVG(num_bikes_available)
            FROM citibike
            GROUP BY 1
            """
conn.execute(query)
res = conn.fetchall()
end = time.time()
print(end-start)
conn.close()

the resulting time is 4.52 seconds.

When doing the same operation in ":memory:"

conn = duckdb.connect(":memory:")
start = time.time()
query = """
                SELECT
                    station_id
                    , AVG(num_bikes_available)
                FROM read_parquet('citibikedata/*.parquet')
                GROUP BY 1
                """
conn.execute(query)
res = conn.fetchall()
stop = time.time()
print(stop - start)
conn.close()

the resulting time is slightly more, 5.83 seconds.

What is the recommended approach, to always do FROM read_parquet('citibikedata/*.parquet') for every query for a slightly slower query speed but with the benefit of not having essentially double storage costs? The documentation does not mention this.

szarnyasg commented 9 months ago

Hello, I guess this answer is a year late so it's not that relevant anymore... that said:

In general, if you 1) want to do interactive analytics with lots of iterations 2) have enough storage 3) can wait for the full data set to load (which doesn't take very long but still)

you should load the data into the database first. You'll have better statistics.

If you have 1) limited space 2) queries that only target a few columns of the data set

then run the queries on Parquet. They still have some statistics so your query plans will not be too bad, and reading from Parquet is fast.