18F / rdbms-subsetter

Generates a subset of a relational database that respects foreign key constraints
Creative Commons Zero v1.0 Universal
313 stars 30 forks source link

Use bulk inserts. #18

Closed jmcarp closed 9 years ago

jmcarp commented 9 years ago

The subsetter spends a significant chunk of overall running time on single-row inserts into the destination tables. This patch accumulates rows in the pending attribute on each table, then periodically flushes updates to the database when the number of accumulated rows exceeds the buffer argument. Depending on the value of buffer, this change reduced running time by 20-30% on a sample database.

This change assumes that rdbms-subset is run against an empty destination database--if this isn't the case, the existence check on line 242 can be incorrect, since it isn't actually making a call to the database (which also saves a lot of time!). I can revise the patch to load existing primary keys from the destination database before inserting, which I think would take care of this potential issue.

catherinedevlin commented 9 years ago

The docs make no promises that pushing to a non-empty target database will work, so I'm OK with that ability disappearing.