citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.36k stars 658 forks source link

INSERT ... RETURNING mixes order of rows #5770

Open crabhi opened 2 years ago

crabhi commented 2 years ago

When inserting multiple rows with e. g. autogenerating ids from a sequence, the returned ids don't match the order of inserted values.

DB setup:

CREATE TABLE t1(id SERIAL PRIMARY KEY, val INT);
SELECT create_distributed_table('t1', 'id');

Observing this:

test_db=# INSERT INTO t1 (val) VALUES (1), (2), (3) RETURNING id, val;
 id | val 
----+-----
  1 |   1
  3 |   3
  2 |   2
(3 rows)

Expected this (PostgreSQL behaviour):

test_db=# INSERT INTO t1 (val) VALUES (1), (2), (3) RETURNING id, val;
 id | val 
----+-----
  1 |   1
  2 |   2
  3 |   3
(3 rows)

Why we need it

The Django ORM has a method called bulk_insert that takes objects without id, inserts them into a database and assigns them back the ids returned by the database. We then insert related objects into a second table. This leads to incorrect foreign keys.

onderkalaci commented 2 years ago

I don't think Postgres guarantees the order of returning, see this: https://www.postgresql.org/message-id/flat/CAMsr%2BYEn5TOuhv_tTwY70S1zXf8jCQ-uixU8aOs4OQs7kojf6Q%40mail.gmail.com

And, given that Postgres does not guarantee, Citus also does not. But, with Citus, given that the rows might come from different nodes, the order is more likely to get in a random order.

To get ordered results, you can use ctes:

WITH results AS ( INSERT INTO t1 (val) VALUES (1), (2), (3) RETURNING id, val) SELECT * FROM results ORDER BY id, val;

This might add some tiny additional latency, but mostly should be fine.

crabhi commented 2 years ago

Hi, thanks for looking into the issue!

I don't think Postgres guarantees the order of returning

In the second message in the thread, Craig suggests mentioning in the docs that the ordering is preserved. However, it didn't get any attention so hard to say.

Thanks for the suggestion with CTE. Would that solve it really, though? I mean -- is it guaranteed that the IDs will get assigned in the same order as the rows appear in the insert? I send a list of objects for persistence and I need to know which of them got which ID.

For now, I have solved it by removing the DEFAULT value and assigning IDs client-side (at the moment sequential w/ locking but considering UUIDs). But it's a bit of a hassle when migrating an existing project that uses an ORM.

crabhi commented 2 years ago

Testing it on one server, the ids get assigned in the correct order so ORDER BY would work. I haven't tried it yet on a real distributed setup.

test_db=# INSERT INTO t1 (val) SELECT generate_series(4, 5000) RETURNING id, val;
INSERT 0 4997
test_db=# SELECT * FROM t1 WHERE id <> val;
 id | val 
----+-----
(0 rows)