Open mistercrunch opened 4 months ago
Relevant: https://github.com/apache/superset/issues/29801#issuecomment-2263052989 - posting this here and editing the SIP to tackle this type of issue
@michael-s-molina I added the section "Considerations for large environments:". I'm removing the DRAFT tag and will open a PR setting us up as CODEOWNERS
@michael-s-molina I added the section "Considerations for large environments:". I'm removing the DRAFT tag and will open a PR setting us up as CODEOWNERS
There's also a "risk:db-migrations" label or one like it. I think this is automatically added by supersetbot, but we could make it effectively block PRs, if that's helpful.
Oh I just found prior art here https://github.com/apache/superset/issues/13351
PRs introducing database migrations must include runtime estimates and downtime expectations.
That one can be pretty tricky for, though I think specialists might have a pretty decent sense for it and be able to craft a decent message for UPDATING.md
PRs introducing database migrations must include runtime estimates and downtime expectations.
@mistercrunch that's what I'm currently doing while testing 4.1. If you correctly code the migrations, you generally do them in batches, so we can easily estimate the necessary time by computing how much time a batch takes and extrapolate it to the number of affected rows. We can even add helpers to do the estimation automatically while looping through batches of data.
@mistercrunch what's the next step on this one? Think it's ready for discussion?
I think it's simply to have the newly setup CODEOWNERS
apply the rules/guidelines stated here
Do we need to vote? Can we just assume consensus and move on?
Well, it's a SIP right now. If we're not going to vote, we should find some form of consensus around parts of it to move forward.
I don't think the CODEOWNERS change needs a SIP... that's just something we do (and might already be done?).
The rest of it, if we're setting up a best practice, would be to retract this as a SIP, convert it to a Discussion thread (of the Ideas variety) and then submit the Discussion thread to the Dev list for lazy consensus, seeking to publish the ideas as a best practice on the Superset Wiki.
Examples: https://github.com/apache/superset/discussions/27769 https://github.com/apache/superset/discussions/21914 https://github.com/apache/superset/discussions/19304
Holler if you want help converting it to a discussion and opening the Lazy Consensus thread.
Motivation
Managing database migrations in Superset currently poses challenges such as deadlocks and downtime during DDL changes. We aim to improve the process to ensure smooth blue/green deployments, reduce downtime, and handle complex migrations effectively.
Proposed Change
Impose Blue/Green-Safe Migrations as part of minor releases:
Functional Migrations: functional programming enforces concepts around idempotency of tasks that limit side effects. For Superset migrations, we want to enforce these ideas in a way that limits mutations, and allows for running the same mutation twice. For example, when adding a new column, we probably want to check whether the new column exists, and delete / recreate it if necessary. More intricately, if we want to change the
Dashboard.json_metadata
schema, we probably want to avoid mutating it in-place, and instead create a newDashboard.json_metadata_v2
and have a deterministic function that take the original, and transforms it as needed before landing it in this new field. In this case, rolling back is more of a matter of pointing back to the original field than applying a reverse migration to try to recreate the original state from the old one. Note that this implies race conditions around the flickering that blue/green implies and/or around rolling back. Philosophically, we prefer this set of tradeoffs to the ones that come with biderectional mutations. In cases where lossiness is unacceptable, the solution may be to maintain both versions of the state for a time, meaning dashboard mutations post migration would keep updatingDashboard.json_metadata
ANDDashboard.json_metadata_v2
for a time.Accumulated Cleanup Migrations: Defer non-critical cleanup migrations (e.g., column deletions) to major version upgrades, in a file (
superset/migration/future.py
) imposing a certain structure that's compatible with alembic-based approach.Downgrade handling: the same way that upgrades should be non-destructive and "functional", downgrades should follow the same principles. In the case where we're creating a new column, the associated downgrade should be a no-op. In many cases where we apply functional migration principles, downgrades are going to be unnecessary as we would simply point to the unaffected, previously existing state. With the example of mutating
Dashboard.json_metadata
intoDashboard.json_metadata_v2
, the downgrade would simply revert to pointing to the original state, and re-executing the migration from v1 to v2 version of the field should be safe. As stated before, in cases where lossiness is unacceptable across migrations, we would need to enforce maintaining both fields/state for the duration of the grace window during with downgrades/reupgrade are safe.Complex Migrations During Release Windows: Hold back on complex migrations that can't comply with the requirements stated in this SIP for specific major release windows. Back to the
Dashboard.json_metadata
example, if losiness is deemed critical, and handling two state is too complicated, we can resort to save this migration for a major release window. In this case, the release manager takes a database backup, schedules downtime, applies the migration and validates things before moving forward. For those release, a forward-only approach is fine.Considerations for large environments: Some migrations that touch potentially large models like
Query
orKeyValue
require special attention. We should have considerations related to locking and transaction atomicity or size, as well as migration duration in large environments. For example, a migration that would add acatalog
column to the query table - and populate it with custom logic - requires special considerations, especially if we're targeting merging it outside of a major release. Main considerations would be around potential for locking (indicating a need for more atomic transaction) and overall migration duration (ideally not taking hours to complete in a large environment). A long migration might not be a blocker for a minor merge, butUPDATING.md
notes would be requires (something like: "the catalog feature updates each and every query in the Query history table, this migration is expected to take roughly 1 hour per-million query - make sure you're aware of that before upgrading")Major release migration bundling: On major version releases, the "future" migrations are bundle into a chain of alembic migrations, and it's well understood that these migration require downtime. PRs that were held up for migrations that require downtime are also merged. Major version also flip the default feature flags to true and/or removes related codepaths
Enforce through a new
CODEOWNERS
policy specific to migration. Have a few committers familiar with this SIP enforce as code owners where migrations are concerned. The goal of the reviewer, in most cases, would be to show a path forward for minor release approval, pointing to the SIP or documentation as to how to make the PR blue/green safe.Labeling
Proposing a new set of migration-related labels
hold:database-migration:needs-review
: apply a hold label (GitHub won't let people merge if there's ahold.*
label)hold:database-migration:next-major
: upon review, if the PR doesn't comply with the SIP requirements for minor releases, apply this hold label, indicating that it's been reviewed and held back until the next major. This implies the person in charge of the major release to merge these PRs during that windowdatabase-migration:approved-for-minor
: as we release the hold above, we'd set this, effectively stamping the migration-related logic as part of the PR for a minor releasedatabase-migration:approved-for-next-major
: same as above, but approving for next majorExamples
New or Changed Public Interfaces
New Dependencies
Migration Plan and Compatibility
Rejected Alternatives