Versioning and migration for `GufeTokenizable`s

dwhswenson commented 1 year ago

[x] Migration for adding keys
[x] Migration for removing keys
[x] Migration for renaming keys
[x] Migration for reorganizing keys within nested structures
[x] Full tests for nested structure migrations functions
[x] Try to preserve existing gufe keys (if possible, ensure that existing keys are not broken by this PR)
[x] Add GufeTokenizableTestMixin to tests of migration processes

We decided that GufeTokenizables need to be versioned, and that we need to provide a migration path to load old serialized representations of the GufeTokenizable. This PR provides the core functionality to do that.

Changes to `GufeTokenizable`

classmethod-property _version: Should return a sortable marking the serialization version.
classmethod serialization_migration: Takes a dict (possibly from another version of serialization, but otherwise suitable to be ingested by GufeTokenizable._from_dict) and a version, and converts the dict to ensure that it can be handled by the current version of the code.

Changes to the serialized representation of `GufeTokenizable`s

Now serialized versions of GufeTokenizables (described as a Python dict/JSON object) will carry the additional key ":version:", which describes the version. Serialized representations without that key will be treated as if its value was 1. Like __module__ and __qualname__, this is added/popped without intervention from contributors.

Labelling versions

I'm proposing we use versions that are simple increasing integers. In principle, users could use anything. However, I think that semantic versioning has no meaning here (the only purpose would be to distinguish breaks that can no longer be handled by the code, which is probably as easily handled with a simple version < oldest_supported_version). Tying to the version of the package also makes it more tedious to handle updates: first, this means that it requires more careful attention to the version each new thing was added in. Second, this is a nightmare for dev installs, because they lie about version number.

On reorganizing keys within nested structures

Although we identified this as target, I don't think this will be easily done with general GufeTokenizables. The issue is that, in deserialization, we can use gufe keys to shortcut through the tree of nested object, gaining some performance. However, if we want to rearrange the tree of the nested structure, we need to reconstruct the entire structure, and the make the modifications. This is a significantly different process than the one we use now, and may be slower.

That said, moving an attribute within a digraph is probably an indication that the original design was significantly flawed, which is not something I think we need to worry about with GufeTokenizables. For example, I have a hard time imagining that we would want to transfer Transformation.protocol.settings to Transformation.chemical_systems.protocol_settings or transfer a hypothetical AlchemicalNetworkCreator.mapper.timeout to AlchemicalNetworkCreator.timeout.

Where this will matter is in Settings, where we expect rapid flux in the near future. However, Settings are not GufeTokenizables. Therefore, their "key-encoded" representation is still a nested dict, and the parent GufeTokenizable (the Protocol) can manage migrations. As of opening this PR, tools to facilitate that haven't been added; that will be a future step, if this PR isn't nixed.

With this PR, changes in settings will require changes in all parent protocols. We could additionally create similar versioning of Settings, which would duck-type the same as the approach here, but build on tools within the JSON codecs instead of the core GufeTokenizable. This can be done, but would be part of a separate PR. That said, a change in organization (moving an attribute from cousin to cousin) will always require that the correction be handled at a common ancestor (grandparent). Purely local changes are not possible for a change that is fundamentally nonlocal.

Sequential migrations

This isn't illustrated here, but the pattern to use is actually something like this:

if version <= 1:
    dct = ... # do stuff for v1 -> v2
if version <= 2:
    dct = ... # do stuff for v2 -> v3
if version <= 3:
    # etc

This does a series of sequential migrations, and continues to support version 1, 2, and 3 with minimal code. If we find that the chain of versions gets too long, it might be worth creating some skips in there, but I kind of doubt it will ever be worth going to the "correct" solution of creating the directed network and finding the minimal path.

codecov[bot] commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (f87e236) 99.18% compared to head (42565aa) 99.20%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #227 +/- ## ========================================== + Coverage 99.18% 99.20% +0.02% ========================================== Files 36 36 Lines 1843 1892 +49 ========================================== + Hits 1828 1877 +49 Misses 15 15 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dwhswenson commented 1 year ago

I agree wrt the rearrangement case, it sounds like this can be done by the tokenizable that holds Settings?

Yes; there's a test that serves as example in 0161938. This is also the downside of this approach: Any change we make to common settings will need to need to be changed in all protocols. And since it is the version of that parent object that matters, there's not much we can do within this approach to make it easier (a settings change that happens in between v2 and v3 OpenFE RBFE protocol could affect v1 to v2 in a Perses non-eq protocol; so the version number isn't the same and can't be used in for migration in a superclass).

Basically, think about the tree structure we have of objects nested in other objects: Protocol, Settings, Subsettings. For a reorganizational change, the version has to be changed somewhere on the tree that knows about both the "move from" and "move to" components -- Protocol knows everything, but earlier parts of the settings tree also might know both.

Now color each of the nodes in that tree by who owns them. Because we don't have a strict separation of ownership at some level of the tree (some settings/subsettings are gufe, some are protocol dev; Protocol is owned by protocol dev) what happens is that any reorganization change we make could affect versioning of something that we don't own.

In general, we've discussed the need to redesign settings in the future. I think we're going to have to settle for an imperfect solution here, include these issues in our settings redesign.

richardjgowers commented 1 year ago

@dwhswenson give me a ping when this is ready for review

dwhswenson commented 1 year ago

@richardjgowers : This should be ready for review.

Additional notes on changes here:

The ':version:' key is considered as a "default value" for serialization purposes. Since it is constant within a version of the code, this makes sense to me. This also made it very easy to preserve gufe keys for most objects.
Gufe keys on LigandNetworks are not preserved. The reason for this is that the ligand network _to_dict explicitly includes the full serialization of the ligands. This includes the version number for the SMC serialization. We could preserve the gufe key now by not including the version (doing that was -- you guessed it -- about 4 lines of code) but the downside of that approach was that a change to the serialization format of SmallMoleculeComponent would then need us to also change the version of the LigandNetwork serialization. With this approach, it should be easier to just use the SMC serialization versions to deserialize.
I walked back the classmethod-property -- apparently that's deprecated starting in Python 3.11. Doing it as just a classmethod (needs to be called from another classmethod, so can't be an instance method.)

pep8speaks commented 1 year ago

Hello @dwhswenson! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file gufe/tests/test_serialization_migration.py:

Line 25:23: E203 whitespace before ':' Line 29:1: E302 expected 2 blank lines, found 1 Line 38:1: E302 expected 2 blank lines, found 1 Line 39:20: E203 whitespace before ':' Line 52:1: E302 expected 2 blank lines, found 1 Line 84:26: E261 at least two spaces before inline comment Line 175:1: E302 expected 2 blank lines, found 1

In the file gufe/tokenization.py:

Line 159:1: E302 expected 2 blank lines, found 1 Line 192:1: E302 expected 2 blank lines, found 1 Line 217:1: E302 expected 2 blank lines, found 1 Line 230:51: W605 invalid escape sequence '.' Line 230:54: W605 invalid escape sequence '[' Line 230:57: W605 invalid escape sequence ']' Line 235:1: E302 expected 2 blank lines, found 1 Line 247:1: E302 expected 2 blank lines, found 1 Line 259:1: E302 expected 2 blank lines, found 1

Comment last updated at 2023-11-16 18:00:36 UTC

dwhswenson commented 11 months ago

@richardjgowers : This is ready for another look. Changes you requested were all docstrings/typing, so this should be good.

dwhswenson commented 11 months ago

Yeah, maybe schema_version? (I would normally prefer serialization_version, but since spelling disagreements might lead to confusion....) 😆

richardjgowers commented 11 months ago

Yeah schema version works for me

On Fri, 3 Nov 2023 at 17:28, David W.H. Swenson @.***> wrote:

Yeah, maybe schema_version? (I would normally prefer serialization_version, but since spelling disagreements might lead to confusion....) 😆

— Reply to this email directly, view it on GitHub https://github.com/OpenFreeEnergy/gufe/pull/227#issuecomment-1792852842, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB4DJZSBJOQ5HTSN453YCUSTLAVCNFSM6AAAAAA4MUGK2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSHA2TEOBUGI . You are receiving this because you were mentioned.Message ID: @.***>

richardjgowers commented 11 months ago

@dwhswenson looks like we have consensus on _schema_version moving forwards

OpenFreeEnergy / gufe