Snowflake-Labs / schemachange

A Database Change Management tool for Snowflake
Apache License 2.0
481 stars 219 forks source link

Add support to execute queries with concurrent futures #199

Closed anthonywainer closed 3 months ago

anthonywainer commented 9 months ago

The proposed PR aims to enhance the functionality of the existing codebase by adding support to execute queries with concurrent futures. This feature will allow multiple queries to be executed simultaneously, which can significantly improve the performance of the application.

sfc-gh-tmathew commented 8 months ago

@anthonywainer

Can you provide one or two examples where parallel DDL executions will be useful?

Also, how do we handle exceptions when one or more SQL files have errors during concurrent execution? What is the expected behavior ?

Additionally, If we end up leveraging this feature, I would request renaming the switch and update the documentation because the end user would not comprehend "futures" unless they are familiar with python programming.

instead of --with-futures call it --execute-concurrently and -ec.

Also, I would imagine Concurrent executions should not execute Versioned SQL files and that it should only be applied after the versioned SQLs are executed.

Thoughts?

rgriffogoes commented 3 months ago

I'm not OP but I see value on this feature as well.

One example is a situation where the project is using lots of Repeatable scripts, then applying lots of files with same prefix could be done in parallel and reduce overall time.

I've recently deployed about 200 scripts and took about 10 minutes. I'm quite confident it could be drastically reduced with some parallel execution.

Error handling is tricky, would the transaction rollback feature work if all future threads share the same session/connection?

One thing I noticed that is not handled in this PR is that the order of scripts are not respected (e.g.: we can't run R_00001_xx.sql and R_0002_xx.sql in parallel). The orchestration of parallel runs should take care of different groups, running groups sequentially, but groups contents in parallel.

Other things is that it is trying to create a ThreadPoolExecutor with as many workers as scripts to run. That doesn't seem to be a good idea. My suggestion would be to have the number of max_workers to be actually part of the cmdline arguments.

sfc-gh-tmathew commented 3 months ago

The gain in performance, anecdotally does not, justify the complexity of managing exceptions. We will not be including this change into the main branch.

Thank you for your contributions