bikeshedder / tusker

PostgreSQL migration management tool
The Unlicense
213 stars 17 forks source link

Is there any way to manage execution order of schema files? #20

Open QtRoS opened 2 years ago

QtRoS commented 2 years ago

I have bunch of SQL files named exactly as the tables they create. Some of these tables have foreign keys pointing to "other tables". Tusker sorts schema files, so in some moment "other tables" simply don't exist. Is there any way to specify exectuion order?

If no, I can implement some kind of clever execution logic: split schema files into statements, execute one by one, remember failed and retry later. What do you think about this idea?

bikeshedder commented 2 years ago

The execution order is determined by the file name. In my own projects I prefixed all files with a number:

image

The retry logic you're describing would result in very long runtimes. If any file contains a unrecoverable error you would need to try out all possible combinations. In the screenshot above you can see that I have 14 SQL files. That already up to 87,178,291,200 possible combinations (14 factorial). Trying those out isn't feasible at all.

When writing the first version of Tusker I had the following idea in mind. Annotations that describe the dependencies of files:

--> depends: a.sql, b.sql
CREATE TABLE c (
    -- ...
);

The algorithm for this kind of dependency resolution isn't hard to write and reading that annotation from migration files isn't either. I just didn't feel the need for this kind of feature, yet.

In the case of circular dependencies between a and b you still need to take care of this manually. You need a third SQL file that adds the circular FKs to a and b that depends on a.sql and b.sql. If you want to go one step further and make this automatic as well the only available solutions I know of are custom DDLs (such as SQLAlchemy or Django). I've heard people use those tools with great success just for migration management.

QtRoS commented 2 years ago

That already up to 87,178,291,200 possible combinations (14 factorial). Trying those out isn't feasible at all.

Actually, it is not as sad as you described. We just need to think on statement level, no files. Imagine 100 files, 10 of them in their statements have references to other files, let's say tables for simplicity. So the algorithm is:

  1. Parse these 100 files, get their statements (for example 300)
  2. Form a queue (deque if we speak about Python) of statements
  3. Loop: execute statement by statement unless queue is empty
  4. Skip errors on failed statements (for example 10) and re-add them to the end of queue , goto 3

After a while queue length will go to zero (decreasing on success, increasing on failures). So this algorithm has convergence (on solvable problems). Of course we will be needing some kind of check for circular dependency, which is quite easy to implement: if number of fails in a row is bigger that queue length, it's time to break (unsolvable).

It's kinda brute force, but it works in practice. I already have a draft of such implementation, can share it with you.

It's not crazy big, but still big change for the project. Let me know if you want to keep it simple.

BTW you may be wondering what kind of problem do I solve? My folder structure looks like:

db\
    public\
        tables\
            table_a.sql
            table_b.sql
            table_c.sql
        views\
            view_a.sql
            view_b.sql
        procedures\
            proc_a.sql
            proc_b.sql
    schema1\
        tables\
            table_x.sql
            table_y.sql
            table_z.sql
    schema2\
        tables\
            table_m.sql
            table_n.sql
            table_o.sql

Every SQL-file is named like SQL-object it creates. Every file can have statements which are dependent on other file(s). Such structure is quite handy to manage. But is't very hard to use with tools other than SQLPackage. It handles DB projects of Visual Studio in a very ellegant way. Tusker with Postgres is the only hope for me these days, something similar to SQLPackage for MS SQL..

bikeshedder commented 2 years ago

It's true that it's not a O(n!) case but rather O(n^2) as we just want to run every statement exactly once and can remove it from the queue once it was executed successfully.

It is however possible to construct some schema where a and b can be reordered but c will only succeed if a is executed before b. Off the top of my head I can think of several ALTER TABLE statements which don't fail regardless of the order they're executed in but the resulting schema is different depending on the execution order. Also problematic could be any code that introspects the current schema and makes changes to it.

It's arguably bad style but I imagine there are other statements that could cause subtle changes to the schema depending on the execution order.

Don't get me wrong. I like the idea. I'm just not 100% convinced yet that it's actually a safe thing to do. The configuration that enables this behavior needs to be marked as "dangerous" at least.

bikeshedder commented 2 years ago

I really like this feature, but I wouldn't make it something that happens automatically every time you call tusker diff. It should be some kind of tusker deps command that tries to execute the schema files and outputs a file containing the order in which they need to be executed.

QtRoS commented 2 years ago

@bikeshedder

I wouldn't make it something that happens automatically every time you call tusker diff

absolutely sure. What do you think about implementing this as alternative way of doing diff, controlled by ... let's say an option in [schema] block of TOML file?

Just for information - reordering files was the very first idea for me. It's nice and easy from the beginning, but goes harder to implement more you think of it. Working with statements makes life much easier. I remember old days working with MS SQL without SQLPackage, back then we ended with statements too. In MS SQL they are divided by "GO" keyword, not semicolon token like in Postgres. So we just needed to split input file by "go"-statements, execute code parts one by one, rollback in case of errors. The same approach works fine in tusker by far. It's a bit slower because you need to commit transaction after every statement (or just reopen cursor). For 6k+ statements (273 files) it takes about 20 seconds to get diff. Not great not terrible.

bikeshedder commented 2 years ago

How about we treat each file as one execution unit that must succeed as a whole. That limits the amount of transactions and reorderings and doesn't require to parse the PostrgeSQL grammar.

Just splitting by ; is not enough as they can be in comments, string literals, functions, etc. Splitting the SQL files into individual statements is far from trivial and would be worth it's own library on PyPI.

Executing files a whole also eliminates the need to track the individual statements in the files. We can just create an order in which the files need to be executed which could be as simple as a text file containing the names of the schema files in the order they need to be executed.

This file should also contain a checksum of the files so the code can detect changes and let the user know that a reordering might be necessary or required if the list doesn't match the current files on disk.

For circular FKs and more complex things the user is forced to create separate files moving those statements into their own execution unit. But that's not necessarily a bad thing as it makes those relationships very obvious and make the execution of migrations less magical.

QtRoS commented 2 years ago

How about we treat each file as one execution unit that must succeed as a whole

Can be unsolvable for the file structure I use. Imagine a file which creates table, adds some comments, constraints and foreign keys. Tables these foreigns keys reference can be non existent at the moment.

Splitting the SQL files into individual statements is far from trivial and would be worth it's own library on PyPI.

Yeah, such things should involve language parsers to be 100% percent correct. I propose well working approximation by now.

This file should also contain a checksum of the files so the code can detect changes and let the user know that a reordering might be necessary or required if the list doesn't match the current files on disk.

This is step back to tools like Liquibase (which does tracking of executed migrations). The power of tusker how I see it is in doing everything automatically (almost magically).

You know, if it is too breaking for now, I can hard-fork tusker. Anyway if you interested I'll share implementation. What do you think about it?

bikeshedder commented 2 years ago

Can be unsolvable for the file structure I use. Imagine a file which creates table, adds some comments, constraints and foreign keys. Tables these foreigns keys reference can be non existent at the moment.

That's exactly what I meant by "create separate files moving those statements into their own execution unit". I currently see little value in that very error prone extra work of implementing a parser that is capable of parsing SQL files understood by PostgreSQL.

With those circular FKs you can't create the FK constraint as part of the CREATE TABLE statement anyways so I see little value in keeping it in the same file. Personally I'd be just as happy with sth. like

- my_table.sql
- my_table_fks.sql

If you really want a single file solution may I propose a special comment that is very hard to include by accident. e.g. something like -- SNIP -- which is very unlikely to be ever on its own inside a line unless someone is trying to write a multi-line default value containing that exact text.

If you implement that automatic dependency resolver I'd be happy to use it myself and ship it as part of the next Tusker release.

QtRoS commented 2 years ago

If you implement that automatic dependency resolver

I will, but on statement level :) So for now it's better for me to do hardfork. I'll share the implementation anyway.

Thank you for the conversation!

haydenbr commented 2 years ago

Just wanted to shed some light here. I've only ever used SQL Server, and despite dealing with all the visual studio and microsoft cruft, they have this really nice set of tools called SQL Server Data Tools (SSDT). One of the major benefits of SSDT is you can just declaratively describe your database schema in sql scripts, and SSDT will look at those scripts, look at a target database, and generate a script (and optionally run the script) to get your database in line with your desired schema. And SSDT handles all of the order of execution of schema scripts. I was looking for something similar for Postgres, and I stumbled on Tusker (and other similar migra-based tools, but I think Tusker would be my preferred option at this point).

I was kind of bummed to realize that with Tusker I would have to manage the order of schema scripts. But I get it. If I understand it correctly, it seems to be a limitation of splitting the schema unto multiple scripts and however migra works. I had really simple example that I was working with to evaluate tools: a users table and a cars table with a foreign key constraint on users. Splitting these into users.sql and cars.sql, cars.sql ran first and complained that users didn't exist. Appending 1 onto the beginning of users.sql fixed the issue. I assume the error is happening when building the "clean" or "canonical" database that's used to diff with the target database.

That being said, I feel like a per-file solution should be pretty straightforward to implement and would be a pretty good first step. Like, maybe a per-statement solution is the 100% correct thing to do, but a per-file solution would be much less effort and would offer a big improvement in terms of managing schema files.

I wouldn't mind taking a stab at it sometime in the next few weeks. Rather than being a separate command like tusker deps, I wonder if it would be better as a command-line flag, something like tusker diff --resolve-deps? For my use case at least, I can't imagine a scenario where I wouldn't want to run diff without it resolving the order of execution for me, but I'm willing to accept that maybe my opinion doesn't represent the majority.

As a next step, one could also imagine caching the resolved dependencies so that future runs would be quicker.

Vija02 commented 1 year ago

Personally, I think having one file as one execution unit is a good compromise. Annotating priority instead of dependencies would also simplify lots of potential issues. It's sort of a standard migration except it reads the annotated priority instead of the file name, allowing you to use whatever file/folder structure you want.

But I also think this is probably better implemented as a separate package/software.