Closed tomtaylor closed 5 months ago
Thanks for the report!
In principle all Postgres threads should run in the same transaction snapshot as explained here. However, not all Postgres servers support this and this feature is disabled for e.g. AWS Aurora. If you were to repeat the run with SET pg_debug_show_queries=true;
you should see statements like SET TRANSACTION SNAPSHOT '00000004-0000B303-1'
. If those are missing, you might be running on an unsupported version for transaction snapshots which could explain the duplicate rows. You could run the query SELECT pg_is_in_recovery(), pg_export_snapshot(), (select count(*) from pg_stat_wal_receiver)
which is the one DuckDB runs to obtain snapshot information - if that doesn't work then this might not be supported.
As a work-around you can then disable multi-threading (using e.g. SET threads=1
or SET pg_connection_limit=1
).
Thanks - we're running on regular Postgres (hosted by Crunchybridge) and I can see those statements executing when I turn on the debug option.
SELECT pg_is_in_recovery(), pg_export_snapshot(), (select count(*) from pg_stat_wal_receiver);
pg_is_in_recovery | pg_export_snapshot | count
-------------------+---------------------+-------
f | 00000058-01D28CC8-1 | 0
(1 row)
This job runs nightly and possibly overlaps with a full logical backup. Is there anything that could interact there?
Interestingly, I just took last night's dump which was a single threaded operation and outputted a single Parquet file for each table and found the same problem:
SELECT id, ROW_NUMBER() OVER (PARTITION BY id) FROM read_parquet('table.parquet') QUALIFY ROW_NUMBER() OVER (PARTITION BY id) > 1;
┌──────────┬─────────────────────────────────────┐
│ id │ row_number() OVER (PARTITION BY id) │
│ int64 │ int64 │
├──────────┼─────────────────────────────────────┤
│ 61122606 │ 2 │
│ 61150142 │ 2 │
│ 61009086 │ 2 │
│ 58830824 │ 2 │
│ 60938241 │ 2 │
│ 60944502 │ 2 │
│ 61145921 │ 2 │
│ 61181739 │ 2 │
│ 60944526 │ 2 │
│ 61131057 │ 2 │
│ 60944530 │ 2 │
│ 59597315 │ 2 │
│ 61135999 │ 2 │
│ 61150647 │ 2 │
│ 61150686 │ 2 │
│ 60944567 │ 2 │
│ 61157702 │ 2 │
│ 61181778 │ 2 │
│ 60922176 │ 2 │
│ 60944457 │ 2 │
│ 60921690 │ 2 │
│ 61129287 │ 2 │
│ 61150691 │ 2 │
│ 60370107 │ 2 │
│ 61132247 │ 2 │
│ 60366945 │ 2 │
│ 60938573 │ 2 │
│ 60944543 │ 2 │
│ 59268777 │ 2 │
│ 60938539 │ 2 │
│ 60938545 │ 2 │
│ 60923318 │ 2 │
│ 60938808 │ 2 │
│ 61130376 │ 2 │
│ 60938569 │ 2 │
│ 61181722 │ 2 │
│ 61161559 │ 2 │
│ 60944418 │ 2 │
├──────────┴─────────────────────────────────────┤
│ 38 rows 2 columns │
└────────────────────────────────────────────────┘
Performing a similar query against production produces zero results.
SELECT
id,
row_number
FROM (
SELECT
id,
row_number() OVER (PARTITION BY id) AS row_number
FROM
table) t
WHERE
row_number > 1
That's interesting. How many rows are in the table, and what kind of operations are running over the system in parallel (only insertions, or also updates/deletes)?
For sanity - could you perhaps try writing to a DuckDB table and checking if that produces the same problem to ensure this is not perhaps triggering an issue in the Parquet writer?
Sure thing - what's the best way to execute that to achieve the same effect? I know how to do
FORCE INSTALL postgres_scanner FROM 'http://nightly-extensions.duckdb.org';
ATTACH 'dbname=database' AS pg (TYPE postgres, READ_ONLY);
CREATE TABLE table AS SELECT * FROM pg.table;
But I'm not sure if that tests the same path?
Yes that tests the same path from the Postgres' reader point of view, so that would be great. Perhaps try running with a single thread again to isolate as much as possible?
I just ran that end to end twice into a DuckDB database and dumped 5037494
rows, followed by 5044745
rows, with no duplicates on either run. To confirm, I downloaded last night's single threaded (by setting the PG connections to 1) Parquet file, and it contains 2 duplicate rows.
The data is moving constantly with inserts/deletes/updates. I estimate 20% of rows change on a daily basis. But it sounds like maybe the Parquet writer might be at fault here?
That is possible. It could also be some connection between the Postgres scanner and the Parquet writer. Could you try dumping the contents of the DuckDB file to Parquet, and seeing if there are still no duplicate rows?
I had to SET preserve_insertion_order = false
to prevent out of memory issues, but there are no duplicates in the exported Parquet file.
@Mytherin any more thoughts on this? 🙏
Could you try SET preserve_insertion_order = false
when doing the copy to Parquet from Postgres directly and checking if there are still duplicates? SET preserve_insertion_order = false
writes data in a substantially different manner - so it's possible that there's a problem in the insertion-order preserving Parquet write and that disabling that circumvented the bug.
These have been running with SET preserve_insertion_order = false
for quite a while now, as we needed to do this to prevent hitting memory limits on the instance we run this job on.
I don't know if it's useful, but I'd be happy to send you one of the sample Parquet files privately.
Unfortunately this is still present in DuckDB 0.10.0.
Could you send over one of the broken Parquet files (mark@duckdblabs.com)?
Another thing I can think of - could you perhaps export the result of postgres_query('SELECT * FROM tbl')
to Parquet instead of querying the table using the Postgres scanner? There might be something funky going on with the ctids we are using (e.g. we might not be grabbing the correct locks to prevent rows from being moved around, perhaps causing duplicate rows to appear in that manner).
I have this issue as well. From what I can tell it is only and issue with rows that were updated during the COPY process. Debugged the queries that are executed on Postgres from Duckdb and found this:
When multiple connections are used, the first connection will start a transaction:
BEGIN TRANSACTION READ ONLY
(defaults to READ COMMITTED)SELECT pg_is_in_recovery(), pg_export_snapshot(), (select count(*) from pg_stat_wal_receiver)
All other connections use:
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ READ ONLY
SET TRANSACTION SNAPSHOT 'some-snapshot-id'
A read committed transaction can see updates from other transactions, so I think this is what is causing the issue. Furthermore, this will be a problem even though we limit threads, and also if we limit amount of postgres connections with SET pg_connection_limit=1
since this will then only use the first read committed transaction. If all the connections were using repeatable read isolation, I assume this wouldn't be a problem anymore, and multiple connections would use the same transaction snapshot.
That's interesting - thanks for investigating! Considering the problem also occurred in single-threaded operation the isolation level could definitely be a likely culprit. I've pushed a PR that switches the main transaction to use repeatable read here - https://github.com/duckdb/postgres_scanner/pull/207. Perhaps you could give that a shot and see if it resolves the issue?
Nice @Mytherin, its in my interest to get a potential fix out asap so happy to test it out. How would I easiest test out your PR? Or did you mean waiting for the nightly build?
You should be able to fetch the artifacts from the PR itself - see here https://github.com/duckdb/postgres_scanner/actions/runs/8601970047?pr=207 (scroll down to "Artifacts")
Thanks a lot @Mytherin, I was able to load the unsigned extension and try this out.
I ran a single connection without any threading to get the most likelihood of changes to be applied during different copy statements with ctid ranges. And concurrently ran a script to update a bunch of rows in Postgres. Was able to verify that with the current extension in 0.10.1 duplicates from the different transactions are created, but with the new extension from the PR this was not the case.
Let me know if you need anything else here, I'd really like to help push this out asap.
Thanks so much for spotting the underlying issue here @noppaz!
Thanks - I’ve merged the PR. I can push out a new version of the extension tomorrow after the nightly builds.
I've published the nightly - FORCE INSTALL postgres_scanner
should get you the new version in v0.10.1 (potentially pending some cloudflare caches being cleared)
What happens?
We run a nightly process to dump some Postgres tables into Parquet files. Sometimes we see a handful of rows duplicated in the output. In last night's example, we saw 33 rows out of 5,396,400 with a duplicate copy. This might be unavoidable with
PER_THREAD_OUTPUT
enabled against a moving data set, but it might be worth documenting this.Does it also mean some rows might be missing?
To Reproduce
I think this might be very difficult to reproduce, but our script looks something like this:
Then:
OS:
Linux
PostgreSQL Version:
14.7
DuckDB Version:
0.9.2
DuckDB Client:
CLI
Full Name:
Tom Taylor
Affiliation:
Breakroom
Have you tried this on the latest
main
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?