cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
327 stars 93 forks source link

Use a database migration library to handle database updates #3533

Open kinow opened 4 years ago

kinow commented 4 years ago

Describe exactly what you would like to see in an upcoming release

From what I understand, at the moment we have Python code running with the Scheduler to perform database migrations.

Whenever a change requires changes to the database such as new fields, deleting a field, adding indexes, etc, developers must write a function in Python that prepares SQL commands to perform the migration.

Number of migrations

With this approach, we may have multiple functions after a while (I think we have one or two right now, but I only remembered it now because I read in some issue about adding something to the DB). If this number grows to something like 50 functions, we will probably need to move it to its own module.

Maintaining the code of the migrations can become hard too. Eventually we may decide to add the code to undo migrations too in case someone needs to downgrade. Existing libraries and tools allow users to keep migrations already in a separated place.

Database support

Related to #3360. At the moment those migrations do not work on other database implementations, as there is code specific to SQLite. I had to update the function a bit when experimenting with SqlAlchemy some time ago.

Alembic uses SqlAlchemy too, which can be database agnostic (Java that says it's multi-platform, you can write native code in Java that becomes platform dependent... the same goes for SqlAlchemy, there are ways to write code that is independent of database, or find a way to include things specific to Oracle, SQL, etc).

Rollback

Some libraries support this out of the box, while others (I think Flyway if I recall correctly) say that migrations should be forward only. But in anyway, this is a feature that if requested by users can be easily done with something like Alembic's downgrade.

Performance

From what I understand, at the moment each database upgrade function performs some simple check to confirm whether it needs to run or not.

One function may check whether a field exists in the tasks table. Another function may check if a table exists, and if not create it. And so it goes.

This is not super expensive - or at least it shouldn't be for a database provider under normal load.

I would need more time to confirm Alembic does the same as Flyway and other Java libraries I used for DB migrations. But normally, they use a table called something like "migrations". This table simply keeps a log of the migrations applied to the database.

The latest log shows the database/schema "version" or "migration". Using this, the Scheduler could simply perform this 3 steps operation:

  1. fetch the current version/migration in the database
  2. fetch the latest migration that we have on disk (i.e. in our Python code, what is the latest migration in - say - cylc.flow.database.migrations module or in some folder?)
  3. if these values do not match, then run each migration in sequence and update the current version/migration

There are no accesses to the database, no possible interruptions to any transaction anywhere, etc.

Reporting

Not super important, but I found it useful when trying SqlAlchemy as it has a built-in logging module, where you can easily do things like:

  1. Turn on logging of all SQL queries executed (independent of database)
  2. See the parameters for the SQL queries
  3. Inspect the model and plot diagrams or reverse-engineer from the code (even though I used SqlAlchemy core, and not ORM, we could still use the metadata about the tables and database programmatically)

And of course you can also use the same approach to log the mutations, and report exactly what was changed between database schemas when you upgraded Cylc (good for auditing I guess).

Additional context

The only Python library I know or have used is alembic, part of SqlAlchemy. This is the library used by JupyterHub. I think Django has its own database migration, but I never used that.

There are also tools that are built to run database migrations like Flyway.

By the way, when upgrading JupyterHub from 0.8 to 1.0, for example, I believe one of the steps to be performed by system administrators is to run a command that I think is something like jupyterhub --upgrade-db. This command just checks the latest migration on DB and on Disk (as explained above) and if the DB is out of date, then applies the migrations from disk. The idea of this issue would be to perform the same step within the Scheduler main loop, but without going through multiple functions, and using a library (DRY).

Pull requests welcome!

kinow commented 4 years ago

Speculative, no milestone set, no plans of working on it. Just reported it in case someone else works on #3360 someday, or if we have issues with the existing approach :+1: