duckdb / postgres_scanner

MIT License
195 stars 35 forks source link

Slow query performance on postgres table #159

Closed daoleno closed 5 months ago

daoleno commented 6 months ago

What happens?

I have a table that has 2M+ rows. It takes ~600 ms to execute the following SQL on a remote machine.

=> select * from blocks limit 10;
Time: 647.549 ms

Query Plan

=> explain analyze select * from blocks limit 10;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.89 rows=10 width=550) (actual time=0.008..0.010 rows=10 loops=1)
   ->  Seq Scan on blocks  (cost=0.00..240240.06 rows=2713106 width=550) (actual time=0.007..0.007 rows=10 loops=1)
 Planning Time: 0.082 ms
 Execution Time: 0.025 ms
(4 rows)

Time: 278.611 ms

But it takes ~1 min to query in DuckDB attach mode.

D select * from db.blocks limit 10;
...
Run Time (s): real 58.916 user 8.175093 sys 0.138111

And if I query using ORDER BY on a field that has an index in PostgreSQL, it almost can't retrieve results (10m+) in DuckDB.

To Reproduce

Following the doc at https://duckdb.org/docs/archive/0.9.2/extensions/postgres. Execute the same SQL query using duckdb and psql on a large table

OS:

MacBook Pro(M2)

PostgreSQL Version:

14.2

DuckDB Version:

0.9.2

DuckDB Client:

duckdb executable binary

Full Name:

daoleno

Affiliation:

daoleno

Have you tried this on the latest main branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Mytherin commented 5 months ago

Thanks for the report! It seems like your connection is very slow or that the rows are very large given that fetching just 10 rows takes 650ms in Postgres. DuckDB does not operate on 10 rows at a time, and instead fetches a lot more data up-front (generally on the order of tens of thousands of rows). Even when executing a LIMIT 10 - this is not pushed into the Postgres scan and a few thousand rows are loaded from Postgres.

Similarly, running ORDER BY/LIMIT in DuckDB will not use the index that exists in Postgres but instead load the data into DuckDB and run the ORDER BY/LIMIT there.

If you want to execute a query with a limit or an ORDER BY/LIMIT in Postgres you can use the postgres_query function to directly run a query in Postgres and fetch the result, e.g.:

select * from postgres_query('db', 'select * from blocks limit 10');