Closed amotl closed 1 year ago
However, the CrateDB dialect does not do that, so that inserting >750k records will blow up the server with OOM errors.
With the patch from crate/crate-python#539, this scenario will start working flawlessly, even with higher numbers in sqlalchemy_efficient_inserts.py
. That's how it should be.
INSERT_RECORDS = 2_750_000
BATCHED_PAGE_SIZE = 20_000
With 74e82c0383e, the insertmanyvalues
feature has been unlocked for the CrateDB dialect. The corresponding test case demonstrates its use with SQLAlchemy Core.
At https://github.com/crate/crate-python/pull/539#issuecomment-1470842449, we outlined that 8073178cc was needed to add a modernized version of the test_bulk_save
test case for SA20 ORM, now using session.add_all()
instead of the legacy session.bulk_save_objects()
, as suggested.
We have been looking into getting performance optimizations from
bulk_save()
to be inherently part ofadd_all()
.-- https://github.com/sqlalchemy/sqlalchemy/discussions/6935#discussioncomment-1233465
The 1.4 version of the "ORM bulk insert" methods are really not very efficient anyway and don't grant that much of a performance bump vs. regular ORM
session.add()
, provided in both cases the objects you provide already have their primary key values assigned. SQLAlchemy 2.0 made a much more comprehensive change to how this all works as well so that all INSERT methods are essentially extremely fast now, relative to the 1.x series.As is illustrated in the performance examples, you can run INSERT statements directly using the
insert()
construct usingconnection.execute()
, that will give you the best performance of all while still using DML (that is, SQL statements and not special APIs such as COPY). If I have a lot of rows to insert and I'm trying to optimize the performance, that's what I'd use.Real "bulk" INSERTs, which usually seem to be a PostgreSQL thing for people, will always be dramatically faster if you use PG COPY directly.
-- https://github.com/sqlalchemy/sqlalchemy/discussions/6935#discussioncomment-4789701
If you want the absolutely highest "bulk save" performance for PostgreSQL drivers including asyncpg I would look at ORM "upsert" statements.
-- https://github.com/sqlalchemy/sqlalchemy/discussions/6935#discussioncomment-1233465
Looks to me this issue is solved, otherwise please re-open with some context.
Problem
There is an indication that the SQLAlchemy CrateDB dialect currently only implements the bulk_save_objects method for bulk inserts, see bulk_test.py#L70. The SQLAlchemy documentation says that this is a legacy method:
Background
Bulk insert has been optimized with SQLAlchemy 2.0. For more background, see:
Focus
Here, we are specifically looking at controlling the batch size.
Analysis
The program
sqlalchemy_efficient_inserts.py
exercises two different bulk transfer options with SQLite, PostgreSQL and CrateDB. The outcome is that the PostgreSQL dialect will happily chunk the inserted records, correctly respecting theinsertmanyvalues_page_size
option. However, the CrateDB dialect does not do that, so that inserting >~500k records will blow up the server with OOM errors.