clockworklabs / SpacetimeDB

Multiplayer at the speed of light
https://spacetimedb.com
Other
4.23k stars 105 forks source link

Actually implement a way for clients to migrate their tables on update #661

Closed cloutiertyler closed 1 month ago

cloutiertyler commented 7 months ago

Kim:

It seems like there are different expectations towards how this should work.

The “obvious” evolution of the existing code is to run the migrate reducer after the schema has been updated acc. to the type definitions in the module (obvious because it is similar to init but at a different point in the lifecycle). It means that module authors basically only have the opportunity to move some data from the old schema to the new one, but what they can do really depends on the API exposed to a module.

@Mario ‘s comments suggest that we want migrate to be invoked instead of updating the schema in a type-directed way, such that the module author is fully responsible for writing ALTER TABLE-style schema transformations (such that the result matches the type definitions, but that is for the author to ensure). If no migrate is defined in the module, we’d proceed as before ofc.

I don’t really have a preference — I think it is more about how to explain it to users, than to make migrations “safe” in any way. How do we agree which route to take? /cc @Tyler Cloutier @John Detter

Kim:

Based on my current understanding of things, I think this could work as follows:

A database (-schema) in SpacetimeDB is essentially immutable, as its identity is the hash of the program which defines it. Changing the program creates a new (empty) database.

IIUC, the idea so far was to allow for a function like the following to be defined:

#[spacetimedb(migration)]
fn migrate(old: MyOldTable) → Result<MyNewTable, MigrationError>

When publishing a new version of the program, this function would get invoked for each row in the old table, writing the output to the new table (this only works if a named db is used, ofc). While that does the job in principle, it has a number of problems: migrating a large table may take a long time, create a lot of IO, and requires the old table to become read-only until the migration commits. The data would also need to be copied even if the change is a no-op (e.g. adding a new optional field, or one with a default value).

However, if we start a new message log upon creating a new database version, such that the first record references the tip of the old log, nothing would actually need to be rewritten until values from before the schema change are read. That is, the function would become a pure mapping from A → B which gets invoked for each row read from A.

The property of being able to “fork” databases in non-linear ways might actually be interesting.

The remaining question for me would be if this mapping function should allow arbitrary code to be executed, or if a more restricted language should be used (see for example https://www.inkandswitch.com/cambria/).

Kim:

Based on my current understanding of things, I think this could work as follows:

A database (-schema) in SpacetimeDB is essentially immutable, as its identity is the hash of the program which defines it. Changing the program creates a new (empty) database.

IIUC, the idea so far was to allow for a function like the following to be defined:

#[spacetimedb(migration)]
fn migrate(old: MyOldTable) → Result<MyNewTable, MigrationError>

When publishing a new version of the program, this function would get invoked for each row in the old table, writing the output to the new table (this only works if a named db is used, ofc). While that does the job in principle, it has a number of problems: migrating a large table may take a long time, create a lot of IO, and requires the old table to become read-only until the migration commits. The data would also need to be copied even if the change is a no-op (e.g. adding a new optional field, or one with a default value).

However, if we start a new message log upon creating a new database version, such that the first record references the tip of the old log, nothing would actually need to be rewritten until values from before the schema change are read. That is, the function would become a pure mapping from A → B which gets invoked for each row read from A.

The property of being able to “fork” databases in non-linear ways might actually be interesting.

The remaining question for me would be if this mapping function should allow arbitrary code to be executed, or if a more restricted language should be used (see for example https://www.inkandswitch.com/cambria/).

cc @Tyler Cloutier @John Detter

kim commented 7 months ago

I am not sure exactly how this now ended up here, but I'll note that I don't think this will end up any different from "traditional" migration tools. We'll need the whole suite of DDL operations for it, so I think @joshua-spacetime and @mamcx are probably better assignees.

I can provide more details if needed.