Trait versioning design sketch

feltech commented 4 months ago

Closes #88. As due diligence before going ahead with an implementation, we need to sketch out what a versioned trait package would look like, and how hosts and managers would interact with it.

So add a contrived trait package with two versions. The "generated" view classes have been manually created to reflect what the traitgen output should be, once implemented.

These view classes are then used in a tutorial notebook that describes how one would go about working with versioned traits.

This work may be abandoned or adapted in favour of more concrete documentation as the feature is developed.

themissingcow commented 4 months ago

Great work folks - quick question - I wondered if you had considered exposing the version of the traits themselves in the class hierarchy, vs at the package level? So you could see more easily in code how the versions work. Eg (or similar):

openassetio_example.traits.example.Unchanged_v1

openassetio_example.traits.example.Updated  # Could be added as a automatic/versionless wrapper later
openassetio_example.traits.example.Updated_v1
openassetio_example.traits.example.Updated_v2

So it is a little more explicit. It might be counterintuitive that openassetio_example.v2.traits.example.Unchanged is actually v1 and you'd have to inspect the Id string to know it was the same as openassetio_example.v1.traits.example.Unchanged? Particularly when there are large numbers of defined traits/specs.

You could leave multiple version declarations in the YAML and deprecate them as-and when, which would give any specific package useful information about the history for migrations etc.

[edit] I think this leaves the logic described largely unchanged, and could also be used to version specifications more explicitly at the code level so they behave consistently with traits - which is one less thing to explain.

   Updated:
     - version: 1
        description: An example.
        usage:
          - entity
          - locale
          - relationship
        properties:
          propertyToKeep:
            type: string
            description: A property that is unchanged between versions.
          propertyToRename:
            type: boolean
            description: >
              A property that has an inappropriate name and should be renamed in the
              next version.
          propertyToRemove:
            type: boolean
            description: A defunct property that should be removed in the next version.
    -  version: 2
        description: An updated example.
        ...
        properties:
          propertyToKeep:
            type: string
            description: A property that is unchanged between versions.
          propertyThatWasRenamed:
            type: boolean
            description: A property that has been renamed.
          propertyThatWasAdded:
            type: float
            description: A new property added in the latest version.

feltech commented 4 months ago

Great work folks - quick question -

Thanks for the feedback! I think its worth at least considering this alternate formulation. Will spend some time sketching it out. Some initial thoughts:

I wondered if you had considered exposing the version of the traits themselves in the class hierarchy, vs at the package level? So you could see more easily in code how the versions work.
So it is a little more explicit. It might be counterintuitive that `openassetio_example.v2.traits.example.Unchanged` is actually v1 and you'd have to inspect the Id string to know it was the same as `openassetio_example.v1.traits.example.Unchanged`? Particularly when there are large numbers of defined traits/specs.

As a general point, not addressing your comment, but worth bearing in mind - I don't think there's any way around having to manually discover changes between versions (in the release notes/library). At some point a human is going to have to look at the docs/code and explicitly write bespoke handler code to work with multiple versions.

That said, putting the version in the class name could help users discover that there are multiple versions available, e.g. via IDE code completion. Would perhaps make the code slightly more readable too, though I'd imagine any code being conditional on trait versions would be pretty clear.

One advantage of versioning at the namespace level is that multiple subpackages can coexist, potentially from multiple sources. Not really explored in the sketch, but v1 and v2 could be separate PyPI (namespace) packages.

Removing the need for a schema version is alluring, though without further changes it removes the ability to have/use two different Specification versions in the same global symbol space (relates to your final point, below).

There's nothing stopping us doing both. We can keep the top-level schema namespace and add duplicates/aliases of lower schema trait views with _vX suffixes.

You could leave multiple version declarations in the YAML and deprecate them as-and when, which would give any specific package useful information about the history for migrations etc.

This is something we have vaguely considered, but haven't properly sketched out.

The example presented vaguely assumes multiple YAML files are passed to traitgen, one for each schema version, but the traitgen interface was deliberately left vague because we need to consider this more thoroughly.

An infinitely growing schema feels like a bad idea, but we could mark deprecations and cull old versions periodically.

There is a small danger that changes could be made to an old version (by accident), making it incompatible with a previous release.

[edit] I think this leaves the logic described largely unchanged, and could also be used to version specifications more explicitly at the code level so they behave consistently with traits - which is one less thing to explain.

Would the idea here be to have e.g. ImageSpecification_v2 in a similar way to traits? I think that introduces a bit more ambiguity to specifications. A specification as conceived currently in this notebook, i.e. versioned by the schema it operates in, makes it clear that it is composed of traits from that schema version (not really enforced, but seems reasonable to assume); whereas an independently versioned specification doesn't tell you anything about what trait versions it composes.

feltech commented 4 months ago

^ (1fbaee9) updated the sketch to add support for versioned suffixes on trait view classes, e.g. UpdatedTrait_v2. This assumes we continue to have per-schema namespaces, but bundle into the namespace all trait versions in that schema version and below - hoping to have the best of both worlds. Still a bit WIP, but some observations

How should removed traits be treated? This is an artifact of the have-your-cake-and-eat-it solution. I.e. its solved if we have a single uber-schema (deprecate and remove eventually), or if we don't duplicate older versioned traits across schema versions.
For traits, it allows you to use just one schema version - i.e. the latest version. Specifications still need to be referenced under a schema version namespace. Specifications could also have an independent version suffix in the same way as traits, but my previous argument against that (above) is that then you can't tell at a glance which schema its constituent traits belong to.
Versioned suffixes allows a terser syntax when all you care about is the "latest" trait version, or where you know the trait hasn't changed. I.e. you can omit both the _vX suffix and the schema version namespace.
The loss of visibility on which schema a particular trait comes from makes side-by-side usage with Specifications a bit less obvious and more bug prone.

themissingcow commented 4 months ago

Thanks so much for fleshing that out @feltech. Sorry a little short on time so a quick reply, please forgive any brevity.

That said, putting the version in the class name could help users discover that there are multiple versions available, e.g. via IDE code completion. Would perhaps make the code slightly more readable too, though I'd imagine any code being conditional on trait versions would be pretty clear.

Yeah - there would still be need to read the docs, but as you say - least it's visible in practice in the code. For me the programming time experience is much clearer/readable if the versions are in the classes not the package. Probably the starkest example would be that you don't have to import x as y to have them coexist or to be clear at the call site which the expected version it is. Eg:

from openassetio_mediacreation.traits.content import LocatableContent_v1, LocatableContent_v2

vs

from openassetio_mediacreation.v1.traits.content import LocatableContent as LocatableContent_v1
from openassetio_mediacreation.v2.traits.content import LocatableContent as LocatableContent_Two

Then when you use them at the call site you're using the actual classes vs aliases which may be defined inconsistently across a code base/projects.

The other situation being if someone has done the following, which makes it less clear in the subsequent code:

from openassetio_mediacreation.v1.traits.content import LoctableContent

Granted you'd have more places to change if there is a compatible but differently versioned update, but if they have a consistent class name hopefully thats easier, and when the version changes it needs consideration anyway?

One advantage of versioning at the namespace level is that multiple subpackages can coexist, potentially from multiple sources. Not really explored in the sketch, but v1 and v2 could be separate PyPI (namespace) packages.

What's the use case for that one? Feels like you say it would need to be in the actual PyPi package namespace vs the modules to allow different release versions to be installed side-by-side?

There's nothing stopping us doing both. We can keep the top-level schema namespace and add duplicates/aliases of lower schema trait views with _vX suffixes.

Could that potentially ambiguous as to whether LoctableContent_v1 is different as it's under a v2 namespace or not? ie is it v2 of v1? Might be complexity worth avoiding?

You could leave multiple version declarations in the YAML and deprecate them as-and when, which would give any specific package useful information about the history for migrations etc.

This is something we have vaguely considered, but haven't properly sketched out.

The example presented vaguely assumes multiple YAML files are passed to traitgen, one for each schema version, but the traitgen interface was deliberately left vague because we need to consider this more thoroughly.

An infinitely growing schema feels like a bad idea, but we could mark deprecations and cull old versions periodically.

Yeah, definitely - you'd probably want good management policy (lol), is one benefit though, that if say, you have 10 version where you just add traits, you don't have to worry about 10 copies of the 99 existing traits in 10 files? Feels a little more manageable.

There is a small danger that changes could be made to an old version (by accident), making it incompatible with a previous release.

Any more than any other change in the rest of the code base we self-regulate with semver?

Would the idea here be to have e.g. ImageSpecification_v2 in a similar way to traits? I think that introduces a bit more ambiguity to specifications. A specification as conceived currently in this notebook, i.e. versioned by the schema it operates in, makes it clear that it is composed of traits from that schema version (not really enforced, but seems reasonable to assume); whereas an independently versioned specification doesn't tell you anything about what trait versions it composes.

Yeah - was just highlighting that some concerns were mentioned that was is no outwardly visible versioning of specifications - the same mechanism as for traits could be used for them. Let's say you changed a trait in ImageSpecification, a host could then say they're using ImageSpecification_v1 or ..._v2 as a way to communicate which one they are using. This could be specially important where the version of "media creation" itself is somewhat lost as a transient dependency inside other packages.

updated the sketch to add support for versioned suffixes on trait view classes, e.g. UpdatedTrait_v2. This assumes we continue to have per-schema namespaces, but bundle into the namespace all trait versions in that schema version and below - hoping to have the best of both worlds. Still a bit WIP, but some observations

Nice one - thanks! I guess my only question is - what is the real concrete benefit of having the two version numbers that outweighs the maintenance complexity and potential ambiguity?

How should removed traits be treated? This is an artifact of the have-your-cake-and-eat-it solution. I.e. its solved if we have a single uber-schema (deprecate and remove eventually),

I'd prefer that approach (single file with multiple versions) as it feels much easier to manager at source and makes it easier to catch if you have inadvertently changed a pervious version. as the git diff will show it, vs you missing eyeballing a difference against a new file and an old one, that wouldn't show up in the diff unless you manually generate it.

Ultimately - what I like about that the most is that it allows you to be explicit about the life-cycles of these traits, as it is very clear in each "release" of trait/spec package, which ones are considered current, and/or deprecated.

It would also allow you to more easily add a deprecation warning mechanism to the older traits, without having to go back and update many many files.

Specifications still need to be referenced under a schema version namespace. Specifications could also have an independent version suffix in the same way as traits, but my previous argument against that (above) is that then you can't tell at a glance which schema its constituent traits belong to.

The loss of visibility on which schema a particular trait comes from makes side-by-side usage with Specifications a bit less obvious and more bug prone.

I guess I'm struggling to follow the relevance of a schema version, if all that matters at the data layer is the traits and their properties - which are now individually versioned. Where do you see the schema version being used? The main thing that matters with specs is which traits it is composing, and the schema version wouldn't tell you that as it doesn't tell you if they have changed or not. Having the version number as per traits gets you the same thing at a per-spec level? And would only change when the actual traits of the spec change, which make it clearer. Eg, ImageSpec may be identical in schemas v1-27, which isn't apparent, unlike ImageSpec_v1 -> ImageSpec_v2 in v24 of the package.

For side-by-side visibility, having it in the class name seems the most concrete? Otherwise you have to use full paths, and/or aliased imports? As per the trait example above?

elliotcmorris commented 4 months ago

I agree with Tom on the naming conversation, having the version in the class name itself is preferred.

Nice one - thanks! I guess my only question is - what is the real concrete benefit of having the two version numbers that outweighs the maintenance complexity and potential ambiguity?

I echo this, having read the examples. I understand there are potential benefits to having a schema version as well as a trait specific version number, but I'm thinking now that the complexity we bring on far outweighs them. At this stage of play, i'd rather versioning be difficult than complex.

How should removed traits be treated? This is an artifact of the have-your-cake-and-eat-it solution. I.e. its solved if we have a single uber-schema (deprecate and remove eventually),

Maybe that is the best approach, on reflection. A similar motive around trying not to be complicated about things.

I guess I'm struggling to follow the relevance of a schema version, if all that matters at the data layer is the traits and their properties - which are now individually versioned.

Echoing this as well. Now that we're putting data in the trait itself, the need to a version outside of it eludes me, save for some speculative compatibility patterns, unless I'm missing something? Is the primary motivation to tie specific trait versions to specifications?

feltech commented 4 months ago

Sorry a little short on time so a quick reply, please forgive any brevity.

Thanks, all good points. Definitely worth updating the sketch to remove any concept of a schema version, to see what shakes out.

One advantage of versioning at the namespace level is that multiple subpackages can coexist, potentially from multiple sources. Not really explored in the sketch, but v1 and v2 could be separate PyPI (namespace) packages.

What's the use case for that one? Feels like you say it would need to be in the actual PyPi package namespace vs the modules to allow different release versions to be installed side-by-side?

Python namespace packages allow you to install two PyPI packages that both have the same top-level namespace. So we could have openassetio-mediacreation-v1 and openassetio-mediacreation-v2 packages that both install under an openassetio_mediacreation top-level package (namespace) name, then further under v1 and v2 subpackages. In addition, we'd likely have an openassetio-mediacreation meta-package that installs all (or just the latest) -vX packages as dependencies, and wires up top-level aliases to the latest -vX schema version.

That way, a host could depend on openassetio_mediacreation and a manager plugin could depend on openassetio-mediacreation-v2, such that if the openassetio_mediacreation meta-package was too old for v2, then v2 could be installed alongside without conflict (the aliases would still point to v1).

It's a very Python-specific packaging technicality, so isn't the most important consideration, but worth noting that removing a top-level schema namespace would prevent us from making use of Python namespace packages in this way in the future.

Specifications still need to be referenced under a schema version namespace. Specifications could also have an independent version suffix in the same way as traits, but my previous argument against that (above) is that then you can't tell at a glance which schema its constituent traits belong to.

The loss of visibility on which schema a particular trait comes from makes side-by-side usage with Specifications a bit less obvious and more bug prone.

I guess I'm struggling to follow the relevance of a schema version, if all that matters at the data layer is the traits and their properties - which are now individually versioned. Where do you see the schema version being used? The main thing that matters with specs is which traits it is composing, and the schema version wouldn't tell you that as it doesn't tell you if they have changed or not. Having the version number as per traits gets you the same thing at a per-spec level? And would only change when the actual traits of the spec change, which make it clearer.

The Host sketch shows that managementPolicy provides a way for the host to detect a schema version that the manager understands.

Specifically in the sketch a preferred_schema_version is detected, which is then used in two places

Subtracting the appropriate version of a Specification from a trait set
Imbuing a trait on some data to publish, choosing the trait version we know the manager understands.

However, the relevant logic could probably be rewritten to make better use of Specifications instead.

Eg, ImageSpec may be identical in schemas v1-27, which isn't apparent, unlike ImageSpec_v1 -> ImageSpec_v2 in v24 of the package.

Yeah, versioning Specifications is a headache... should a Specification's version really change when a constituent trait version changes? I argued in the Notebook that, conceptually, Specifications are trait version agnostic, but practically, the Specification classes won't work unless you use/expect the trait versions that the Specification class was built against. Having a top-level schema version seems to solve that conflict between the conceptual and practical, since the Specification version and the versions of its constituent traits are all given by the schema version (notwithstanding cross-schema Specifications).

For side-by-side visibility, having it in the class name seems the most concrete? Otherwise you have to use full paths, and/or aliased imports? As per the trait example above?

Yeah full paths or aliases would be needed. Having a version tag in the Specification class name instead could be easier to recognize.

The problem is if Specifications are independently versioned, rather than linked to the schema version. Losing the schema namespace/tag means you lose the ability to know at-a-glance what schema version a Specfication's constituent traits come from. If you detect a trait set satisfies a particular Specification, then knowing the schema version of that Specification allows you to use the trait view classes of its constituent traits, without having to look it up in the YAML/code. This all relies on Specifications being kept up to date with the trait IDs in its schema version, which I think is reasonable.

feltech commented 4 months ago

^ (b1d9925b)

Removed the concept of a top-level schema version.
Hence combined all YAML and .py into a single package.
Hence introduced a deprecated attribute.
Minor tweak for version component in ID strings - added zero-padding, e.g. .v002, for future proof string parsing.

I quite like it.

My main concern with removing a schema version was for Specifications.

We'll need to independently version specifications, and bump the version whenever a constituent trait is updated, which goes against the philosophy that Specifications should be trait version agnostic.
We also "just have to know" which trait versions make up a particular Specification version, rather than knowing at-a-glance based on a top-level schema version.

However, in the sketch these issues aren't so bad. For the latter issue in particular, making proper use of the Specification view class ameliorates the problem of "just knowing" the correct trait view class to use.

Removing the top-level schema does of course simplify things, as reflected in the amount of text cut from the sketch.

Adding multi-version support to the YAML means a fairly large breaking change to our YAML (JSON-) schema. But so be it.

The tweak to ensure fixed-width version tags in trait IDs means that a future isImbuedToIgnoringVersion() function becomes trivial.

feltech commented 3 months ago

^ added a WIP DR to summarise the journey through this sketch. It's still in progress, no need to review yet.

Also not sure where such a DR should go.

This sketch that led to it is in the MediaCreation repo.
The prior related DR is in the core OpenAssetIO repo.
All actual changes required will be in the TraitGen repo.

I reasoned, without much conviction, that the applicability is to the ecosystem as a whole, and we've previously discussed that MediaCreation is best placed as a "landing page" for the OpenAssetIO ecosystem :man_shrugging:

feltech commented 3 months ago

^ After discussion, we've determined that we really want to avoid a breaking change in MediaCreation. So I've updated the sketch to say that, rather than the non-suffixed classes aliasing the "latest" version, they instead alias version 1, and in addition flag a deprecation warning.

This means MediaCreation should be entirely source compatible, even once we introduce versioning support.

feltech commented 3 months ago

DR is ready to review. Still not sure if it's in the correct repo, or if it should be pulled out to a separate PR?

themissingcow commented 3 months ago

DR is ready to review. Still not sure if it's in the correct repo, or if it should be pulled out to a separate PR?

Sorry not had time to catch up on all the changes - might TraitGen make sense? As ultimately, if I understand correctly, this is all an emergent phenomenon from there? And isn’t actually a core thing?

elliotcmorris commented 3 months ago

Sorry not had time to catch up on all the changes - might TraitGen make sense? As ultimately, if I understand correctly, this is all an emergent phenomenon from there? And isn’t actually a core thing?

You're technically correct, but I can't imagine anyone going to TraitGen to find project planning information. I say we put it here, this is where people will probably come.

feltech commented 3 months ago

I think even if the DR is put in this repo, we should separate it from this notebook, since we can't merge this notebook until the implementation is done. Whereas, ideally, we should merge the DR before embarking on the implementation.

As for which repo to put the DR in...

Perhaps, if we had the resource, we'd split things up better and set up CI to run Jupyter in all repos, move DRs and notebooks to the repos that are most associated with them, and have a central ecosystem landing page/repo that links to them all.

Alas, in the meantime, I'd also propose to put the DR in MediaCreation, since its not a core library concern, there is precedent for DRs in MediaCreation, and it keeps the DR next to the design notebook.

There is perhaps something in the point that the DR will affect the YAML structure, which is a concern for MediaCreation, since it's a document as much as an input to codegen.

I wonder if we should move the previous DR on trait versioning (in-data vs. out-of-data) into MediaCreation, since they're obviously closely related? - I know there's an argument that the previous DR might have led to core library changes, so its reasonable to leave where it is.

feltech commented 3 months ago

Dropped the DR from this PR and created a new PR in this repo: https://github.com/OpenAssetIO/OpenAssetIO-MediaCreation/pull/95

feltech commented 3 months ago

since we can't merge this notebook until the implementation is done.

Hence set this PR back to Draft status, to prevent accidental merging.

feltech commented 3 months ago

Merging this is blocked on https://github.com/OpenAssetIO/OpenAssetIO-TraitGen/issues/80

OpenAssetIO / OpenAssetIO-MediaCreation

Trait versioning design sketch #90