caracal-pipeline / stimela

Stimela 2.0
GNU General Public License v2.0
5 stars 4 forks source link

add version specifiers to schemas (and potentially cabs and recipes) #306

Open o-smirnov opened 6 months ago

o-smirnov commented 6 months ago

Related to discussions with @JSKenyon... how do we deal with evolving CLIs. Multiple versions of, say, wsclean are already supported via image: version, but there's no way to tell Stimela that a particular parameter is only available with e.g. version 3.1 and up (or, conversely, has been deprecated).

Proposal:

Possible more advanced features:

Thoughts @sjperkins @SpheMakh @landmanbester?

sjperkins commented 6 months ago

I'll take a closer look at this during the week but, for the moment, it's probably worth mentioning that Schema Evolution is an entire topic in it's own right: there's probably a good deal of prior knowledge that can be drawn on.

On example that springs to mind is Google Protocol Buffers which define message schemas for use by Remote Procedure Calls (gRPC). They can evolve over time and, for e.g. "Modifying gRPC services over time", suggests some best practices in this context.

JSKenyon commented 6 months ago

I am obviously on board for this but I think we could be even more ambitious/user-friendly. I think I have mentioned my reservations about coupling image versions to the cult-cargo version. Specifically, this may eventually make installation and reproducibility very difficult. I think that @o-smirnov's points above are part of the solution but I think that the real change needs to happen in cult-cargo.

Specifically, I think that we should consider turning cult-cargo into a queryable database of images and schemas i.e. all schemas and images persist in this database. This would make it possible to completely decouple image versioning from package versioning and simultaneously (partially) solves the schema problem as every image could be associated with a schema. This also means that there is no need to expose the schema parameter to the average user, which will spare us some headaches. Obviously, this interface should support pulling in schemas at runtime i.e. as part of validation.

  • If we really want to be user friendly, we don't just delete a deactivated parameter from the schema, we leave a stub entry so that stimela can tell the user they've specified a parameter from a wrong version of the cab.

I think that it is completely acceptable to bail out if a user attempts to use a parameter which is not part of the associated schema, and being too clever on this point may lead to future pain. We can easily minimize schema duplication as having multiple images share a schema would not be difficult. As an aside, this also means that new images could be added without requiring a package release of cult-cargo which in turn alleviates developer burden and makes using dev/branch images much simpler.

In order to fully support the above, we would need to make versioning in stimela recipes less opaque i.e. at present there is an implication that using an image is the latest version in cult-cargo (when using cult-cargo images). This is great for user-friendliness but my personal opinion is that this does not serve the end goal of reproducibility. To address this, we could consider a stimela publish command that does the following:

There are some other things we could consider (if we decide to be very ambitious):

I am going to stop here as this is getting muddled. I could also try and parse this into separate ideas and put them on cult-cargo if necessary.

sjperkins commented 6 months ago

I'd imagine at some point that dependency resolution may become a necessity (similar to pip). There's an outdated Python package called mixology which appears to handle dependency resolution for the concept of a generic package (i.e. not specific to a Python package on pypi). It's based on the pubgrub algorithm, which seems to be the current state of the art.

o-smirnov commented 6 months ago

I'm a bit reticent about adding top-heavy structures... the current scheme is simple, and relies on standard repositories (PyPI and quay.io) where all versions persist. It's also easy to replicate for somebody who wants to maintain their own cult-cargo-like collection. I also think all information required for full reproducibility is already in there, unless I'm missing something. Let me try to address some points.

(partially) solves the schema problem as every image could be associated with a schema.

Well it's already being done in the reverse sense -- each cult-cargo cab in the release already has a specific cab: image: version entry. And recipes don't directly deal with images -- they specify cab definitions -- so I don't think an explicit image->cab link is necessary.

there is no need to expose the schema parameter to the average user,

Which schema parameter do you mean? The average user just works with an overall cult-cargo release version, which, in turn, implies frozen versions of all constituent packages under the hood (where the average user need not look).

less opaque i.e. at present there is an implication that using an image is the latest version in cult-cargo (when using cult-cargo images)

I don't think this is the implication, but I also think we mean different things by "latest", are you thinking of it as a mutable, continuously updating version? There is no such thing once a given release of cult-cargo is out. There is simply a default image version (which does have a specific well-defined number). It is "latest" only in the sense of "latest at time of this specific cult-cargo release". Once a cult-cargo release is out, the associated images don't change anymore.

The only time we change images is during a cult-cargo prerelease process. I.e. 0.1.3 is the next version -- I'll keep pushing new images with that tag until 0.1.3 is released. The cult-cargo build script already has protections for this, it will refuse to push images for a known release.

I think that it is completely acceptable to bail out if a user attempts to use a parameter which is not part of the associated schema, and being too clever on this point may lead to future pain.

Agreed. I was merely suggesting a friendlier message when bailing out ("unsupported parameter because you have version blah" as opposed to just "unknown parameter").

As an aside, this also means that new images could be added without requiring a package release of cult-cargo

This is already the case somewhat. As soon as we push 0.1.3pre1, we are free to push and push 0.1.3 images, until we hit the release button on 0.1.3 proper (see above). Also, a dev version of something doesn't even need to use cult-cargo. I could push breifast images to my own personal repo and keep shipping dev cabs pointing to them, all the way until breifast makes it into cult-cargo.

could consider a stimela publish command that does the following:

Good idea (and touches on the certifiable workflows discussion), so let's break this out into a separate issue/discussion.

Partial support for github-based steps - not a fully formed idea as yet

I was thinking of something similar in #115 (in a pure venv context), but indeed this could also be done with images.

JSKenyon commented 6 months ago

I'm a bit reticent about adding top-heavy structures... the current scheme is simple, and relies on standard repositories (PyPI and quay.io) where all versions persist. It's also easy to replicate for somebody who wants to maintain their own cult-cargo-like collection. I also think all information required for full reproducibility is already in there, unless I'm missing something. Let me try to address some points.

I don't think abandoning either of PyPI or quay.io would be required. I think that having a layer on top of them that maps schema (in the sense of cab definitions i.e. the parameters the cab accepts) may just be helpful in the long run.

Well it's already being done in the reverse sense -- each cult-cargo cab in the release already has a specific cab: image: version entry. And recipes don't directly deal with images -- they specify cab definitions -- so I don't think an explicit image->cab link is necessary.

Agreed, although I really do stand by my opinion that using cult-cargo as version control is going to bite us. It means we are vulnerable to changes in Python versions and may place limits on how long after the fact a result remains reproducible. The hypothetical scenario in my head is as follows: I run a recipe today with cult-cargo==1.0.0 on Python 3.9.18. In five years (yes, this is quite a long time but bear with me), someone else wants to reproduce that result on Python 3.14.8. There was no cult-cargo release for that version of Python and let us assume for the sake of argument that the user cannot install an older Python (which may be true). There is now no easy way to reproduce the result, despite all the images still being available.

Which schema parameter do you mean? The average user just works with an overall cult-cargo release version, which, in turn, implies frozen versions of all constituent packages under the hood (where the average user need not look).

I think the use of schema confused this point - apologies. What I mean is that for packages included in cult-cargo if we have this extra layer which knows which cabs (i.e. which parameter schema) map to which version the user need never worry about this. This achieves the same result as the current approach, but without requiring a specific version of cult-cargo in order for the cab definition to be correct.

I don't think this is the implication, but I also think we mean different things by "latest", are you thinking of it as a mutable, continuously updating version? There is no such thing once a given release of cult-cargo is out. There is simply a default image version (which does have a specific well-defined number). It is "latest" only in the sense of "latest at time of this specific cult-cargo release". Once a cult-cargo release is out, the associated images don't change anymore.

I understand this point but I maintain that this is opaque. It means that if a user were to read the recipe, there would be absolutely no way of knowing which versions were in use without either checking other files (requiring more expert knowledge) or installing a specific version of cult-cargo which brings us back to my earlier point.

The only time we change images is during a cult-cargo prerelease process. I.e. 0.1.3 is the next version -- I'll keep pushing new images with that tag until 0.1.3 is released. The cult-cargo build script already has protections for this, it will refuse to push images for a known release.

Ok, this is something I hadn't thought about. That is fair enough. I will point out that in the current model, cult-cargo may end up releasing many versions very quickly if the goal is to make upstream packages available rapidly (which I think is the case).

This is already the case somewhat. As soon as we push 0.1.3pre1, we are free to push and push 0.1.3 images, until we hit the release button on 0.1.3 proper (see above). Also, a dev version of something doesn't even need to use cult-cargo. I could push breifast images to my own personal repo and keep shipping dev cabs pointing to them, all the way until breifast makes it into cult-cargo.

On the point about private repos, absolutely. I have done so too. What I meant by this point is that we could push a hypothetical quartical:1.0.0, quartical:1.0.1 and quartical:1.1.0 all without ever requiring a cult-cargo release if we used my proposed (completely theoretical at this point) approach. I think that this could eventually spare us pain and potentially speed up the process for getting new versions into the hands of users.

Finally, just to reiterate, if the goal is reproducibility, I sincerely believe we have to decouple the "runtime" requirements i.e. the cab definitions and images, from the Python code/packaging infrastructure.

o-smirnov commented 6 months ago

There was no cult-cargo release for that version of Python and let us assume for the sake of argument that the user cannot install an older Python (which may be true).

Fair point. This is where the PyPI model breaks. Still, I like the simplicity of it for now, so maybe we can muddle our way forward to a more structured scheme while we retain backwards compatibility?

It means that if a user were to read the recipe, there would be absolutely no way of knowing which versions were in use

Arguably this is a good thing. The top-level recipe should not be burdened by details, it's more readable that way. For those that want to get into the versioning weeds, there is the stimela publish idea. So if you (literally) publish a result, you provide the recipe as a top-level recipe, and the stimela publish outputs as supplementary material (which is required to reproduce).

cult-cargo may end up releasing many versions very quickly if the goal is to make upstream packages available rapidly (which I think is the case).

The model I followed for 0.1.2 was multiple 0.1.2preX releases of cult-cargo, while the images themselves were versioned 0.1.2 and were being updated. Do you think this works going forward? Bleeding edge people can use the pre-releases and/or track cult-cargo master. At some point we make a proper release, images get frozen, and another pre-release cycle starts.

JSKenyon commented 6 months ago

Fair point. This is where the PyPI model breaks. Still, I like the simplicity of it for now, so maybe we can muddle our way forward to a more structured scheme while we retain backwards compatibility?

Yeah - any changes weren't going to be short term regardless. Just something to keep at the back of our heads.

Arguably this is a good thing. The top-level recipe should not be burdened by details, it's more readable that way. For those that want to get into the versioning weeds, there is the stimela publish idea. So if you (literally) publish a result, you provide the recipe as a top-level recipe, and the stimela publish outputs as supplementary material (which is required to reproduce).

Agreed. So the policy is that versions float with the cult-cargo version until such time as you freeze them in with publish.

The model I followed for 0.1.2 was multiple 0.1.2preX releases of cult-cargo, while the images themselves were versioned 0.1.2 and were being updated. Do you think this works going forward? Bleeding edge people can use the pre-releases and/or track cult-cargo master. At some point we make a proper release, images get frozen, and another pre-release cycle starts.

Ok, that works.