duckdb / postgres_scanner

MIT License
212 stars 35 forks source link

How to speed up queries with DuckDB connected to PostgreSQL DB using postgres_scanner? #123

Open sidroopdaska opened 10 months ago

sidroopdaska commented 10 months ago

I took inspiration from https://duckdb.org/2022/09/30/postgres-scanner.html and setup a connection between DuckDB and our Supabase project. Simple queries, like the one listed below, take an average of 40s to complete. I believe this is quite slow when compared to the performance benchmarks.

import duckdb
from duckdb import DuckDBPyConnection

DB_URL = "postgresql://postgres:<user>@<link>:<port>/<database_name>"

def connect_duckdb_supabase() -> DuckDBPyConnection:
    con = duckdb.connect()
    con.execute("INSTALL postgres_scanner")
    con.execute("LOAD postgres_scanner")
    con.execute(f"CALL postgres_attach('{DB_URL}')")

    return con

if __name__ == "__main__":
    con = connect_duckdb_supabase()

    import time
    for _ in range(5):
        start = time.time()
        res = con.execute("SELECT * FROM index WHERE label='fine'").fetchall()
        print(f"duration: {time.time() - start}")

I'm running DuckDB (v0.9.1) on a VM with 1TB RAM and 172 CPUs Our Supabase PostgreSQL DB (v15.1.0.129) is backed by 1GB RAM, 2vCPUs and 8GB file storage Both the VM and DB are in the same region us-east-1. My VM has a 10Gbps internet bandwidth.

The index table has 15M rows (4.3GB in size) with the following schema: id (primary key, text), label (text), classifier_softmax (float[])

My understanding is that DuckDB will auto choose max number of threads depending on the os.cpu_count.

Any help on how to speed up and optimise my analytic queries will be greatly appreciated!

Mytherin commented 10 months ago

For point queries you might want to enable filter pushdown in the attach, e.g.:

CALL postgres_attach('{DB_URL}', filter_pushdown=true);

Also note that the benchmarks run in the blog post have both DuckDB and Postgres running on the same machine which means the data transfer is significantly faster than it would be if they were connected over a network connection.

sidroopdaska commented 10 months ago

I tried the filter_pushdown=true flag, but there was no noticeable difference.

Also note that the benchmarks run in the blog post have both DuckDB and Postgres running on the same machine

I see

Besides this feature flag, are there any other optimisations I can make?

StephanGeorg commented 9 months ago

@sidroopdaska Have you found a solution/reason for the performance problem? I also have performance issues with my large table (118 million rows) which runs indefinitely even with simple queries.

 SELECT * FROM large_table WHERE id = 1234;
Mytherin commented 9 months ago

Have you already tried enabling SET pg_experimental_filter_pushdown=true?

StephanGeorg commented 9 months ago

Yes, just did that. Thank you for mentioning this. For me it was not clear that

SELECT * FROM large_table WHERE id = 1234;

is converted into

COPY (
  SELECT "id" FROM "public"."large_table" 
  WHERE ctid BETWEEN '(0,0)'::tid AND '(1000,0)'::tid)
TO STDOUT (FORMAT binary);

in my case

SET pg_experimental_filter_pushdown=TRUE;

led to

COPY (
  SELECT "id" 
  FROM "public"."large_table" 
  WHERE ctid BETWEEN '(0,0)'::tid AND '(1000,0)'::tid AND ("id" = '1234' AND "id" IS NOT NULL)
  ) TO STDOUT (FORMAT binary);

and also increasing

SET pg_pages_per_task = 100000;

helped to speed up the query a lot.

davidgovea commented 6 months ago

Similar issue here - I think my ctids are pretty fragmented due to high writes to the table. We're using RDS with postgres 15.

I'm fetching about 5mil rows in 12sec when I use a single batch (setting pg_pages_per_task above 5mil), but as soon as it starts paging, the process takes much longer.

Setting it to 3mil (for 2 batches) takes about 3x as long.

It could be nice to be able to turn off paging altogether, or to specify a different numerical partition key (connector-x allows this)

j-carson commented 4 months ago

Yup, you can add a fourth user reporting this problem.

teaguesterling commented 4 months ago

And a fifth. I was not aware of the experimental filter pushdown flag, but the extra overhead from paging made the connector far too slow to use for day-to-day queries on large tables.

davidgovea commented 4 months ago

I think that this new option might resolve this issue:

SET pg_use_ctid_scan=false;


FWIW we have been using this workaround in prod for a while: SET pg_pages_per_task=9876543; // silly huge number that effectively disables paging

but we'll move to the new option as soon as it's available (not sure if it's landed in a release yet, and we're not on latest 10.x)

teaguesterling commented 4 months ago

I tested this in a sandbox and SET pg_use_ctid_scan=false along with SET pg_experimental_filter_pushdown=true worked well for WHERE clauses. Aggregates didn't push down, but that was not expected.