erezsh / reladiff

High-performance diffing of large datasets across databases
https://reladiff.readthedocs.io/en/latest/index.html#
Other
365 stars 9 forks source link

Connection Pooling for comparing many tables #53

Open alex-mirkin opened 1 week ago

alex-mirkin commented 1 week ago

Is your feature request related to a problem? Please describe. When comparing hundreds of tables in parallel there is a connection overhead of connecting and closing the connection per table.

Describe the solution you'd like It would be great to have some way to use connection pooling for comparing many tables. One option is to allow the connect() method accepting mysql_pool.get_connection() object (in case of mysql for example) instead of passing the credentials.

We can take this one step further and think about a method similar to diff_tables() that can accept a list of TableSegments and manage the threads / subprocesses and the connection pooling internally.

In this case, the goal will be to saturate the database constantly with x number of concurrent connections, whether by many small tables (single thread) or few big tables (multi thread). The max_threadpool_size can be dynamic per table, calculated from the table data length. This will minimize the total time it takes to compare the tables.

Describe alternatives you've considered If the user was responsible for creating the db connection instead of using the connect() method, one could implement a connection pooling to improve the performance and reduce potential connection errors. But currently the connection must be made using the connect() method.

erezsh commented 1 week ago

We should be able to run several diff_tables() on the same connections. If that doesn't work right now, it's a bug that we need to fix.

Can you explain what changes are necessary beyond that, and why?