Open sidroopdaska opened 1 year ago
For point queries you might want to enable filter pushdown in the attach, e.g.:
CALL postgres_attach('{DB_URL}', filter_pushdown=true);
Also note that the benchmarks run in the blog post have both DuckDB and Postgres running on the same machine which means the data transfer is significantly faster than it would be if they were connected over a network connection.
I tried the filter_pushdown=true
flag, but there was no noticeable difference.
Also note that the benchmarks run in the blog post have both DuckDB and Postgres running on the same machine
I see
Besides this feature flag, are there any other optimisations I can make?
@sidroopdaska Have you found a solution/reason for the performance problem? I also have performance issues with my large table (118 million rows) which runs indefinitely even with simple queries.
SELECT * FROM large_table WHERE id = 1234;
Have you already tried enabling SET pg_experimental_filter_pushdown=true
?
Yes, just did that. Thank you for mentioning this. For me it was not clear that
SELECT * FROM large_table WHERE id = 1234;
is converted into
COPY (
SELECT "id" FROM "public"."large_table"
WHERE ctid BETWEEN '(0,0)'::tid AND '(1000,0)'::tid)
TO STDOUT (FORMAT binary);
in my case
SET pg_experimental_filter_pushdown=TRUE;
led to
COPY (
SELECT "id"
FROM "public"."large_table"
WHERE ctid BETWEEN '(0,0)'::tid AND '(1000,0)'::tid AND ("id" = '1234' AND "id" IS NOT NULL)
) TO STDOUT (FORMAT binary);
and also increasing
SET pg_pages_per_task = 100000;
helped to speed up the query a lot.
Similar issue here - I think my ctids are pretty fragmented due to high writes to the table. We're using RDS with postgres 15.
I'm fetching about 5mil rows in 12sec when I use a single batch (setting pg_pages_per_task above 5mil), but as soon as it starts paging, the process takes much longer.
Setting it to 3mil (for 2 batches) takes about 3x as long.
It could be nice to be able to turn off paging altogether, or to specify a different numerical partition key (connector-x allows this)
Yup, you can add a fourth user reporting this problem.
And a fifth. I was not aware of the experimental filter pushdown flag, but the extra overhead from paging made the connector far too slow to use for day-to-day queries on large tables.
I think that this new option might resolve this issue:
SET pg_use_ctid_scan=false;
FWIW we have been using this workaround in prod for a while:
SET pg_pages_per_task=9876543;
// silly huge number that effectively disables paging
but we'll move to the new option as soon as it's available (not sure if it's landed in a release yet, and we're not on latest 10.x)
I tested this in a sandbox and SET pg_use_ctid_scan=false
along with SET pg_experimental_filter_pushdown=true
worked well for WHERE clauses. Aggregates didn't push down, but that was not expected.
I took inspiration from https://duckdb.org/2022/09/30/postgres-scanner.html and setup a connection between DuckDB and our Supabase project. Simple queries, like the one listed below, take an average of 40s to complete. I believe this is quite slow when compared to the performance benchmarks.
I'm running DuckDB (v0.9.1) on a VM with 1TB RAM and 172 CPUs Our Supabase PostgreSQL DB (v15.1.0.129) is backed by 1GB RAM, 2vCPUs and 8GB file storage Both the VM and DB are in the same region us-east-1. My VM has a 10Gbps internet bandwidth.
The
index
table has 15M rows (4.3GB in size) with the following schema:id (primary key, text), label (text), classifier_softmax (float[])
My understanding is that DuckDB will auto choose max number of threads depending on the os.cpu_count.
Any help on how to speed up and optimise my analytic queries will be greatly appreciated!