[SIP-142] Improving Database Migration Management

mistercrunch commented 4 months ago

Motivation

Managing database migrations in Superset currently poses challenges such as deadlocks and downtime during DDL changes. We aim to improve the process to ensure smooth blue/green deployments, reduce downtime, and handle complex migrations effectively.

Proposed Change

Impose Blue/Green-Safe Migrations as part of minor releases:
- blue/green migration assume that the previous version of the app can ALWAYS run against the newer version of the database. This mean there cannot be destructive operations as part of a PR that contains a database migration
- "theoritical but highly unlikely lossiness" can be accepted but understood and mitigated. For example if there's a short period of time during the migration where the old app might be updating a field that's effectively getting deprecated as part of the migration, and as a result that update/transaction might be lost in the shuffle, we can assess the likelihood of such operation and its critically, and accept it
Functional Migrations: functional programming enforces concepts around idempotency of tasks that limit side effects. For Superset migrations, we want to enforce these ideas in a way that limits mutations, and allows for running the same mutation twice. For example, when adding a new column, we probably want to check whether the new column exists, and delete / recreate it if necessary. More intricately, if we want to change the Dashboard.json_metadata schema, we probably want to avoid mutating it in-place, and instead create a new Dashboard.json_metadata_v2 and have a deterministic function that take the original, and transforms it as needed before landing it in this new field. In this case, rolling back is more of a matter of pointing back to the original field than applying a reverse migration to try to recreate the original state from the old one. Note that this implies race conditions around the flickering that blue/green implies and/or around rolling back. Philosophically, we prefer this set of tradeoffs to the ones that come with biderectional mutations. In cases where lossiness is unacceptable, the solution may be to maintain both versions of the state for a time, meaning dashboard mutations post migration would keep updating Dashboard.json_metadata AND Dashboard.json_metadata_v2 for a time.
Accumulated Cleanup Migrations: Defer non-critical cleanup migrations (e.g., column deletions) to major version upgrades, in a file (superset/migration/future.py) imposing a certain structure that's compatible with alembic-based approach.
Downgrade handling: the same way that upgrades should be non-destructive and "functional", downgrades should follow the same principles. In the case where we're creating a new column, the associated downgrade should be a no-op. In many cases where we apply functional migration principles, downgrades are going to be unnecessary as we would simply point to the unaffected, previously existing state. With the example of mutating Dashboard.json_metadata into Dashboard.json_metadata_v2, the downgrade would simply revert to pointing to the original state, and re-executing the migration from v1 to v2 version of the field should be safe. As stated before, in cases where lossiness is unacceptable across migrations, we would need to enforce maintaining both fields/state for the duration of the grace window during with downgrades/reupgrade are safe.
Complex Migrations During Release Windows: Hold back on complex migrations that can't comply with the requirements stated in this SIP for specific major release windows. Back to the Dashboard.json_metadata example, if losiness is deemed critical, and handling two state is too complicated, we can resort to save this migration for a major release window. In this case, the release manager takes a database backup, schedules downtime, applies the migration and validates things before moving forward. For those release, a forward-only approach is fine.
Considerations for large environments: Some migrations that touch potentially large models like Query or KeyValue require special attention. We should have considerations related to locking and transaction atomicity or size, as well as migration duration in large environments. For example, a migration that would add a catalog column to the query table - and populate it with custom logic - requires special considerations, especially if we're targeting merging it outside of a major release. Main considerations would be around potential for locking (indicating a need for more atomic transaction) and overall migration duration (ideally not taking hours to complete in a large environment). A long migration might not be a blocker for a minor merge, but UPDATING.md notes would be requires (something like: "the catalog feature updates each and every query in the Query history table, this migration is expected to take roughly 1 hour per-million query - make sure you're aware of that before upgrading")
Major release migration bundling: On major version releases, the "future" migrations are bundle into a chain of alembic migrations, and it's well understood that these migration require downtime. PRs that were held up for migrations that require downtime are also merged. Major version also flip the default feature flags to true and/or removes related codepaths
Enforce through a new CODEOWNERS policy specific to migration. Have a few committers familiar with this SIP enforce as code owners where migrations are concerned. The goal of the reviewer, in most cases, would be to show a path forward for minor release approval, pointing to the SIP or documentation as to how to make the PR blue/green safe.

Labeling

Proposing a new set of migration-related labels

hold:database-migration:needs-review: apply a hold label (GitHub won't let people merge if there's a hold.* label)
hold:database-migration:next-major: upon review, if the PR doesn't comply with the SIP requirements for minor releases, apply this hold label, indicating that it's been reviewed and held back until the next major. This implies the person in charge of the major release to merge these PRs during that window
database-migration:approved-for-minor: as we release the hold above, we'd set this, effectively stamping the migration-related logic as part of the PR for a minor release
database-migration:approved-for-next-major: same as above, but approving for next major

Examples

Adding a new column
Changing the schema
Viz structure migration
...

New or Changed Public Interfaces

Introduce new feature flags for migration-dependent features.
Update CLI tools to support deferred and bundled migrations.

New Dependencies

None.

Migration Plan and Compatibility

Detailed documentation on using feature flags for migrations.
Establish a tracking system for deferred migrations to ensure they are bundled and tested before major releases.

Rejected Alternatives

Immediate application of all migrations, which increases the risk of downtime and deadlocks.
Manual handling of complex migrations without specific release windows, leading to unpredictable deployment issues.

mistercrunch commented 3 months ago

Relevant: https://github.com/apache/superset/issues/29801#issuecomment-2263052989 - posting this here and editing the SIP to tackle this type of issue

mistercrunch commented 3 months ago

@michael-s-molina I added the section "Considerations for large environments:". I'm removing the DRAFT tag and will open a PR setting us up as CODEOWNERS

rusackas commented 3 months ago

@michael-s-molina I added the section "Considerations for large environments:". I'm removing the DRAFT tag and will open a PR setting us up as CODEOWNERS

There's also a "risk:db-migrations" label or one like it. I think this is automatically added by supersetbot, but we could make it effectively block PRs, if that's helpful.

mistercrunch commented 3 months ago

Oh I just found prior art here https://github.com/apache/superset/issues/13351

PRs introducing database migrations must include runtime estimates and downtime expectations.

That one can be pretty tricky for, though I think specialists might have a pretty decent sense for it and be able to craft a decent message for UPDATING.md

michael-s-molina commented 3 months ago

PRs introducing database migrations must include runtime estimates and downtime expectations.

@mistercrunch that's what I'm currently doing while testing 4.1. If you correctly code the migrations, you generally do them in batches, so we can easily estimate the necessary time by computing how much time a batch takes and extrapolate it to the number of affected rows. We can even add helpers to do the estimation automatically while looping through batches of data.

rusackas commented 2 months ago

@mistercrunch what's the next step on this one? Think it's ready for discussion?

mistercrunch commented 2 months ago

I think it's simply to have the newly setup CODEOWNERS apply the rules/guidelines stated here

mistercrunch commented 2 months ago

Do we need to vote? Can we just assume consensus and move on?

rusackas commented 1 month ago

Well, it's a SIP right now. If we're not going to vote, we should find some form of consensus around parts of it to move forward.

I don't think the CODEOWNERS change needs a SIP... that's just something we do (and might already be done?).

The rest of it, if we're setting up a best practice, would be to retract this as a SIP, convert it to a Discussion thread (of the Ideas variety) and then submit the Discussion thread to the Dev list for lazy consensus, seeking to publish the ideas as a best practice on the Superset Wiki.

Examples: https://github.com/apache/superset/discussions/27769 https://github.com/apache/superset/discussions/21914 https://github.com/apache/superset/discussions/19304

Holler if you want help converting it to a discussion and opening the Lazy Consensus thread.

apache / superset