Load parquet files into table and then query vs. query from files directly - best practice?

Hi, I was benchmarking duckdb on a 3,5M row dataset stored as 100 parquet files totalling 4 Gb.

First instinct was to load the data into duckdb and then do the query:

conn = duckdb.connect("citibike.duckdb")
conn.execute("CREATE TABLE citibike AS SELECT * FROM read_parquet('citibikedata/*.parquet')"
start = time.time()
query = """
            SELECT
                station_id
                , AVG(num_bikes_available)
            FROM citibike
            GROUP BY 1
            """
conn.execute(query)
res = conn.fetchall()
end = time.time()
print(end-start)
conn.close()

the resulting time is 4.52 seconds.

When doing the same operation in ":memory:"

conn = duckdb.connect(":memory:")
start = time.time()
query = """
                SELECT
                    station_id
                    , AVG(num_bikes_available)
                FROM read_parquet('citibikedata/*.parquet')
                GROUP BY 1
                """
conn.execute(query)
res = conn.fetchall()
stop = time.time()
print(stop - start)
conn.close()

the resulting time is slightly more, 5.83 seconds.

What is the recommended approach, to always do FROM read_parquet('citibikedata/*.parquet') for every query for a slightly slower query speed but with the benefit of not having essentially double storage costs? The documentation does not mention this.

duckdb / duckdb-web

Load parquet files into table and then query vs. query from files directly - best practice? #411