SQLAlchemy: Repair batched inserts with "insertmanyvalues" feature

amotl commented 1 year ago

Problem

There is an indication that the SQLAlchemy CrateDB dialect currently only implements the bulk_save_objects method for bulk inserts, see bulk_test.py#L70. The SQLAlchemy documentation says that this is a legacy method:

This method is a legacy feature as of the 2.0 series of SQLAlchemy. For modern bulk INSERT and UPDATE, see the sections ORM Bulk INSERT Statements and ORM Bulk UPDATE by Primary Key.

Background

The traditional Core method of multiple parameter set invocation for INSERTs and other statements is to send multiple parameters.

For DML statements such as "INSERT", "UPDATE" and "DELETE", we can send multiple parameter sets to the Connection.execute() method by passing a list of dictionaries instead of a single dictionary, which indicates that the single SQL statement should be invoked multiple times, once for each parameter set. This style of execution is known as "executemany".

-- https://docs.sqlalchemy.org/tutorial/dbapi_transactions.html#tutorial-multiple-parameters

Bulk insert has been optimized with SQLAlchemy 2.0. For more background, see:

Focus

Here, we are specifically looking at controlling the batch size.

A key characteristic of “insertmanyvalues” is that the size of the INSERT statement is limited on a fixed max number of “values” clauses as well as a dialect-specific fixed total number of bound parameters that may be represented in one INSERT statement at a time.

When the number of parameter dictionaries given exceeds a fixed limit, or when the total number of bound parameters to be rendered in a single INSERT statement exceeds a fixed limit (the two fixed limits are separate), multiple INSERT statements will be invoked within the scope of a single Connection.execute() call, each of which accommodate for a portion of the parameter dictionaries, referred towards as a “batch”.

Analysis

The program sqlalchemy_efficient_inserts.py exercises two different bulk transfer options with SQLite, PostgreSQL and CrateDB. The outcome is that the PostgreSQL dialect will happily chunk the inserted records, correctly respecting the insertmanyvalues_page_size option. However, the CrateDB dialect does not do that, so that inserting >~500k records will blow up the server with OOM errors.

INSERT_RECORDS = 750_000
BATCHED_PAGE_SIZE = 10_000

records = [{"id": i, "name": f"foo_{i}"} for i in range(INSERT_RECORDS)]
engine = sa.create_engine(dburi, insertmanyvalues_page_size=BATCHED_PAGE_SIZE, echo=True)

insertable = table.insert()
with engine.begin() as conn:
    conn.execute(insertable, parameters=records)

amotl commented 1 year ago

However, the CrateDB dialect does not do that, so that inserting >750k records will blow up the server with OOM errors.

With the patch from crate/crate-python#539, this scenario will start working flawlessly, even with higher numbers in sqlalchemy_efficient_inserts.py. That's how it should be.

INSERT_RECORDS = 2_750_000
BATCHED_PAGE_SIZE = 20_000

amotl commented 1 year ago

SQLAlchemy Core

With 74e82c0383e, the insertmanyvalues feature has been unlocked for the CrateDB dialect. The corresponding test case demonstrates its use with SQLAlchemy Core.

SQLAlchemy ORM

At https://github.com/crate/crate-python/pull/539#issuecomment-1470842449, we outlined that 8073178cc was needed to add a modernized version of the test_bulk_save test case for SA20 ORM, now using session.add_all() instead of the legacy session.bulk_save_objects(), as suggested.

We have been looking into getting performance optimizations from bulk_save() to be inherently part of add_all().

-- https://github.com/sqlalchemy/sqlalchemy/discussions/6935#discussioncomment-1233465

The 1.4 version of the "ORM bulk insert" methods are really not very efficient anyway and don't grant that much of a performance bump vs. regular ORM session.add(), provided in both cases the objects you provide already have their primary key values assigned. SQLAlchemy 2.0 made a much more comprehensive change to how this all works as well so that all INSERT methods are essentially extremely fast now, relative to the 1.x series.

As is illustrated in the performance examples, you can run INSERT statements directly using the insert() construct using connection.execute(), that will give you the best performance of all while still using DML (that is, SQL statements and not special APIs such as COPY). If I have a lot of rows to insert and I'm trying to optimize the performance, that's what I'd use.

Real "bulk" INSERTs, which usually seem to be a PostgreSQL thing for people, will always be dramatically faster if you use PG COPY directly.

-- https://github.com/sqlalchemy/sqlalchemy/discussions/6935#discussioncomment-4789701

If you want the absolutely highest "bulk save" performance for PostgreSQL drivers including asyncpg I would look at ORM "upsert" statements.

-- https://github.com/sqlalchemy/sqlalchemy/discussions/6935#discussioncomment-1233465

seut commented 1 year ago

Looks to me this issue is solved, otherwise please re-open with some context.

crate / sqlalchemy-cratedb