SQLAlchemy: Enable the `insertmanyvalues` feature for batch size control

amotl commented 1 year ago

About

SQLAlchemy's insertmanyvalues feature lets you control the batch size of INSERT operations using the insertmanyvalues_page_size option, which is applicable to all of the engine-, connection-, and statement- objects.

-- https://docs.sqlalchemy.org/core/connections.html#controlling-the-batch-size

References

The patch will resolve this issue.

crate/sqlalchemy-cratedb#97

/cc @SStorm, @WalBeh, @seut, @hlcianfagna

amotl commented 1 year ago

This patch apparently breaks other tests. I will have to investigate.

amotl commented 1 year ago

This patch apparently breaks other tests.

Indeed. Enabling use_insertmanyvalues_wo_returning = True makes the SqlAlchemyBulkTest.test_bulk_save fail. Because this area how bulk inserts may have been handled before, seems to be declared as "legacy" now, it may make sense to remove it altogether.

References

amotl commented 1 year ago

Recap / Introduction

Enabling use_insertmanyvalues_wo_returning = True makes the SqlAlchemyBulkTest.test_bulk_save fail.

Problem: There is an indication that the SQLAlchemy CrateDB dialect currently only implements the bulk_save_objects method for bulk inserts [...]. This method is a legacy feature as of the 2.0 series of SQLAlchemy [...].

crate/sqlalchemy-cratedb#97

Solution

8073178cc adds a modernized version of this test case for SA20, using session.add_all() instead of the legacy session.bulk_save_objects(), as suggested by the SQLAlchemy documentation. More details are provided in the sections below and at https://github.com/crate/sqlalchemy-cratedb/issues/97.

SQLAlchemy ORM » Legacy

bulk_save_objects: Perform a bulk save of the given list of objects.

This method is a legacy feature as of the 2.0 series of SQLAlchemy. For modern bulk INSERT and UPDATE, see the sections ORM Bulk INSERT Statements and ORM Bulk UPDATE by Primary Key.

-- https://docs.sqlalchemy.org/orm/session_api.html#sqlalchemy.orm.Session.bulk_save_objects

The Session includes legacy methods for performing "bulk" INSERT and UPDATE statements. These methods share implementations with the SQLAlchemy 2.0 versions of these features, described at ORM Bulk INSERT Statements and ORM Bulk UPDATE by Primary Key, however lack many features, namely RETURNING support as well as support for session-synchronization.

-- https://docs.sqlalchemy.org/orm/queryguide/dml.html#legacy-session-bulk-insert-methods

The 1.4 version of the "ORM bulk insert" methods are really not very efficient and don't grant that much of a performance bump vs. regular ORM session.add(), provided in both cases the objects you provide already have their primary key values assigned. SQLAlchemy 2.0 made a much more comprehensive change to how this all works as well so that all INSERT methods are essentially extremely fast now, relative to the 1.x series.

-- https://github.com/sqlalchemy/sqlalchemy/discussions/6935#discussioncomment-4789701

SQLAlchemy ORM » Modern

A list of parameter dictionaries sent to the Session.execute.params parameter, separate from the Insert object itself, will invoke bulk INSERT mode for the statement, which essentially means the operation will optimize as much as possible for many rows.

-- https://docs.sqlalchemy.org/orm/queryguide/dml.html#orm-queryguide-bulk-insert

We have been looking into getting performance optimizations from bulk_save() to be inherently part of add_all().

-- https://github.com/sqlalchemy/sqlalchemy/discussions/6935#discussioncomment-1233465

The remaining performance limitation, that the cursor.executemany() DBAPI method does not allow for rows to be fetched, is resolved for most backends by foregoing the use of executemany() and instead restructuring individual INSERT statements to each accommodate a large number of rows in a single statement that is invoked using cursor.execute(). This approach originates from the psycopg2 fast execution helpers feature of the psycopg2 DBAPI, which SQLAlchemy incrementally added more and more support towards in recent release series.

-- https://docs.sqlalchemy.org/core/connections.html#engine-insertmanyvalues

Differences

For your information, this diff outlines the differences between the legacy and the modern variant. Because the test_bulk_save_modern test case method has been added, this would otherwise not be visible when looking at the patch.

-def test_bulk_save_legacy(self):
+def test_bulk_save_modern(self):
     """
-    Verify legacy SQLAlchemy bulk INSERT mode.
+    Verify modern SQLAlchemy bulk INSERT mode.
     """
+
+    # Adjust diff output on errors.
+    self.maxDiff = None
+
     chars = [
         self.character(name='Arthur', age=35),
         self.character(name='Banshee', age=26),
@@ -12,20 +16,21 @@

     fake_cursor.description = ()
     fake_cursor.rowcount = len(chars)
-    fake_cursor.executemany.return_value = [
+    fake_cursor.execute.return_value = [
         {'rowcount': 1},
         {'rowcount': 1},
         {'rowcount': 1},
     ]
-    self.session.bulk_save_objects(chars)
-    (stmt, bulk_args), _ = fake_cursor.executemany.call_args
+    self.session.add_all(chars)
+    self.session.commit()
+    (stmt, bulk_args), _ = fake_cursor.execute.call_args

-    expected_stmt = "INSERT INTO characters (name, age) VALUES (?, ?)"
+    expected_stmt = "INSERT INTO characters (name, age) VALUES (?, ?), (?, ?), (?, ?)"
     self.assertEqual(expected_stmt, stmt)

     expected_bulk_args = (
-        ('Arthur', 35),
-        ('Banshee', 26),
-        ('Callisto', 37)
+        'Arthur', 35,
+        'Banshee', 26,
+        'Callisto', 37,
     )

amotl commented 1 year ago

@mfussenegger and @matriv: I needed to add more changes to satisfy the SqlAlchemyBulkTest and I would like you to review this once more, also to learn about how SQLAlchemy is evolving on this spot. Thank you already!

amotl commented 1 year ago

Thanks for the review, @matriv. I hope your comments have been addressed well. I will let it sitting until @mfussenegger can have another look.

crate / crate-python