Open o-smirnov opened 5 months ago
I'll take a closer look at this during the week but, for the moment, it's probably worth mentioning that Schema Evolution is an entire topic in it's own right: there's probably a good deal of prior knowledge that can be drawn on.
On example that springs to mind is Google Protocol Buffers which define message schemas for use by Remote Procedure Calls (gRPC). They can evolve over time and, for e.g. "Modifying gRPC services over time", suggests some best practices in this context.
I am obviously on board for this but I think we could be even more ambitious/user-friendly. I think I have mentioned my reservations about coupling image versions to the cult-cargo
version. Specifically, this may eventually make installation and reproducibility very difficult. I think that @o-smirnov's points above are part of the solution but I think that the real change needs to happen in cult-cargo
.
Specifically, I think that we should consider turning cult-cargo
into a queryable database of images and schemas i.e. all schemas and images persist in this database. This would make it possible to completely decouple image versioning from package versioning and simultaneously (partially) solves the schema problem as every image could be associated with a schema. This also means that there is no need to expose the schema parameter to the average user, which will spare us some headaches. Obviously, this interface should support pulling in schemas at runtime i.e. as part of validation.
- If we really want to be user friendly, we don't just delete a deactivated parameter from the schema, we leave a stub entry so that stimela can tell the user they've specified a parameter from a wrong version of the cab.
I think that it is completely acceptable to bail out if a user attempts to use a parameter which is not part of the associated schema, and being too clever on this point may lead to future pain. We can easily minimize schema duplication as having multiple images share a schema would not be difficult. As an aside, this also means that new images could be added without requiring a package release of cult-cargo
which in turn alleviates developer burden and makes using dev/branch images much simpler.
In order to fully support the above, we would need to make versioning in stimela recipes less opaque i.e. at present there is an implication that using an image is the latest
version in cult-cargo
(when using cult-cargo
images). This is great for user-friendliness but my personal opinion is that this does not serve the end goal of reproducibility. To address this, we could consider a stimela publish
command that does the following:
cult-cargo
.recipe.yaml
and a flat cabs.yaml
.pip freeze
any virtual environments into a venv
directory.info
fields to any step requiring externally defined Python (breifast
case) to make it very clear that these are not images/where that code should live.There are some other things we could consider (if we decide to be very ambitious):
step: recipe: cookbook.cc/breifast:version
, and that recipe would internally have explicitly defined image versions/schemas associated with it. github
-based steps - not a fully formed idea as yet - e.g. cabs: name: package: https://github.com/ratt-ru/QuartiCal.git
where this implies installing said package in the python-astro
container. This would be powerful for testing/development.I am going to stop here as this is getting muddled. I could also try and parse this into separate ideas and put them on cult-cargo
if necessary.
I'd imagine at some point that dependency resolution may become a necessity (similar to pip). There's an outdated Python package called mixology which appears to handle dependency resolution for the concept of a generic package (i.e. not specific to a Python package on pypi). It's based on the pubgrub algorithm, which seems to be the current state of the art.
I'm a bit reticent about adding top-heavy structures... the current scheme is simple, and relies on standard repositories (PyPI and quay.io) where all versions persist. It's also easy to replicate for somebody who wants to maintain their own cult-cargo-like collection. I also think all information required for full reproducibility is already in there, unless I'm missing something. Let me try to address some points.
(partially) solves the schema problem as every image could be associated with a schema.
Well it's already being done in the reverse sense -- each cult-cargo cab in the release already has a specific cab: image: version
entry. And recipes don't directly deal with images -- they specify cab definitions -- so I don't think an explicit image->cab link is necessary.
there is no need to expose the schema parameter to the average user,
Which schema parameter do you mean? The average user just works with an overall cult-cargo release version, which, in turn, implies frozen versions of all constituent packages under the hood (where the average user need not look).
less opaque i.e. at present there is an implication that using an image is the latest version in cult-cargo (when using cult-cargo images)
I don't think this is the implication, but I also think we mean different things by "latest", are you thinking of it as a mutable, continuously updating version? There is no such thing once a given release of cult-cargo is out. There is simply a default image version (which does have a specific well-defined number). It is "latest" only in the sense of "latest at time of this specific cult-cargo release". Once a cult-cargo release is out, the associated images don't change anymore.
The only time we change images is during a cult-cargo prerelease process. I.e. 0.1.3 is the next version -- I'll keep pushing new images with that tag until 0.1.3 is released. The cult-cargo build script already has protections for this, it will refuse to push images for a known release.
I think that it is completely acceptable to bail out if a user attempts to use a parameter which is not part of the associated schema, and being too clever on this point may lead to future pain.
Agreed. I was merely suggesting a friendlier message when bailing out ("unsupported parameter because you have version blah" as opposed to just "unknown parameter").
As an aside, this also means that new images could be added without requiring a package release of cult-cargo
This is already the case somewhat. As soon as we push 0.1.3pre1, we are free to push and push 0.1.3 images, until we hit the release button on 0.1.3 proper (see above). Also, a dev version of something doesn't even need to use cult-cargo. I could push breifast images to my own personal repo and keep shipping dev cabs pointing to them, all the way until breifast makes it into cult-cargo.
could consider a stimela publish command that does the following:
Good idea (and touches on the certifiable workflows discussion), so let's break this out into a separate issue/discussion.
Partial support for github-based steps - not a fully formed idea as yet
I was thinking of something similar in #115 (in a pure venv context), but indeed this could also be done with images.
I'm a bit reticent about adding top-heavy structures... the current scheme is simple, and relies on standard repositories (PyPI and quay.io) where all versions persist. It's also easy to replicate for somebody who wants to maintain their own cult-cargo-like collection. I also think all information required for full reproducibility is already in there, unless I'm missing something. Let me try to address some points.
I don't think abandoning either of PyPI or quay.io would be required. I think that having a layer on top of them that maps schema (in the sense of cab definitions i.e. the parameters the cab accepts) may just be helpful in the long run.
Well it's already being done in the reverse sense -- each cult-cargo cab in the release already has a specific
cab: image: version
entry. And recipes don't directly deal with images -- they specify cab definitions -- so I don't think an explicit image->cab link is necessary.
Agreed, although I really do stand by my opinion that using cult-cargo
as version control is going to bite us. It means we are vulnerable to changes in Python
versions and may place limits on how long after the fact a result remains reproducible. The hypothetical scenario in my head is as follows: I run a recipe today with cult-cargo==1.0.0
on Python 3.9.18
. In five years (yes, this is quite a long time but bear with me), someone else wants to reproduce that result on Python 3.14.8
. There was no cult-cargo
release for that version of Python
and let us assume for the sake of argument that the user cannot install an older Python (which may be true). There is now no easy way to reproduce the result, despite all the images still being available.
Which schema parameter do you mean? The average user just works with an overall cult-cargo release version, which, in turn, implies frozen versions of all constituent packages under the hood (where the average user need not look).
I think the use of schema confused this point - apologies. What I mean is that for packages included in cult-cargo
if we have this extra layer which knows which cabs (i.e. which parameter schema) map to which version the user need never worry about this. This achieves the same result as the current approach, but without requiring a specific version of cult-cargo
in order for the cab definition to be correct.
I don't think this is the implication, but I also think we mean different things by "latest", are you thinking of it as a mutable, continuously updating version? There is no such thing once a given release of cult-cargo is out. There is simply a default image version (which does have a specific well-defined number). It is "latest" only in the sense of "latest at time of this specific cult-cargo release". Once a cult-cargo release is out, the associated images don't change anymore.
I understand this point but I maintain that this is opaque. It means that if a user were to read the recipe, there would be absolutely no way of knowing which versions were in use without either checking other files (requiring more expert knowledge) or installing a specific version of cult-cargo
which brings us back to my earlier point.
The only time we change images is during a cult-cargo prerelease process. I.e. 0.1.3 is the next version -- I'll keep pushing new images with that tag until 0.1.3 is released. The cult-cargo build script already has protections for this, it will refuse to push images for a known release.
Ok, this is something I hadn't thought about. That is fair enough. I will point out that in the current model, cult-cargo
may end up releasing many versions very quickly if the goal is to make upstream packages available rapidly (which I think is the case).
This is already the case somewhat. As soon as we push 0.1.3pre1, we are free to push and push 0.1.3 images, until we hit the release button on 0.1.3 proper (see above). Also, a dev version of something doesn't even need to use cult-cargo. I could push breifast images to my own personal repo and keep shipping dev cabs pointing to them, all the way until breifast makes it into cult-cargo.
On the point about private repos, absolutely. I have done so too. What I meant by this point is that we could push a hypothetical quartical:1.0.0
, quartical:1.0.1
and quartical:1.1.0
all without ever requiring a cult-cargo
release if we used my proposed (completely theoretical at this point) approach. I think that this could eventually spare us pain and potentially speed up the process for getting new versions into the hands of users.
Finally, just to reiterate, if the goal is reproducibility, I sincerely believe we have to decouple the "runtime" requirements i.e. the cab definitions and images, from the Python code/packaging infrastructure.
There was no cult-cargo release for that version of Python and let us assume for the sake of argument that the user cannot install an older Python (which may be true).
Fair point. This is where the PyPI model breaks. Still, I like the simplicity of it for now, so maybe we can muddle our way forward to a more structured scheme while we retain backwards compatibility?
It means that if a user were to read the recipe, there would be absolutely no way of knowing which versions were in use
Arguably this is a good thing. The top-level recipe should not be burdened by details, it's more readable that way. For those that want to get into the versioning weeds, there is the stimela publish
idea. So if you (literally) publish a result, you provide the recipe as a top-level recipe, and the stimela publish
outputs as supplementary material (which is required to reproduce).
cult-cargo may end up releasing many versions very quickly if the goal is to make upstream packages available rapidly (which I think is the case).
The model I followed for 0.1.2 was multiple 0.1.2preX releases of cult-cargo, while the images themselves were versioned 0.1.2 and were being updated. Do you think this works going forward? Bleeding edge people can use the pre-releases and/or track cult-cargo master. At some point we make a proper release, images get frozen, and another pre-release cycle starts.
Fair point. This is where the PyPI model breaks. Still, I like the simplicity of it for now, so maybe we can muddle our way forward to a more structured scheme while we retain backwards compatibility?
Yeah - any changes weren't going to be short term regardless. Just something to keep at the back of our heads.
Arguably this is a good thing. The top-level recipe should not be burdened by details, it's more readable that way. For those that want to get into the versioning weeds, there is the
stimela publish
idea. So if you (literally) publish a result, you provide the recipe as a top-level recipe, and thestimela publish
outputs as supplementary material (which is required to reproduce).
Agreed. So the policy is that versions float with the cult-cargo
version until such time as you freeze them in with publish
.
The model I followed for 0.1.2 was multiple 0.1.2preX releases of cult-cargo, while the images themselves were versioned 0.1.2 and were being updated. Do you think this works going forward? Bleeding edge people can use the pre-releases and/or track cult-cargo master. At some point we make a proper release, images get frozen, and another pre-release cycle starts.
Ok, that works.
Related to discussions with @JSKenyon... how do we deal with evolving CLIs. Multiple versions of, say, wsclean are already supported via
image: version
, but there's no way to tell Stimela that a particular parameter is only available with e.g. version 3.1 and up (or, conversely, has been deprecated).Proposal:
Cabs shalt have an optional
version
attribute, populated fromcab: image: version
if not set.Parameter schemas shalt have an optional
versions
attribute, specified PyPI style, e.g.versions: >=3.1
.Inputs/outputs shalt be (de)activated by comparing their version string to the cab version, if both are specified. There are standard libraries for version parsing.
If we really want to be user friendly, we don't just delete a deactivated parameter from the schema, we leave a stub entry so that stimela can tell the user they've specified a parameter from a wrong version of the cab.
Possible more advanced features:
Support version specification in the step itself, e.g.
cab: wsclean>=3.1
. In the first instance, this at least allows the recipe to error out in prevalidation if the wrong version of the cab is defined.This opens the door to having multiply versioned cab definitions in e.g. cult-cargo, with stimela being able to resolve which one to use if the recipe specifies a particular dependency. Those would have to live under a separately structured
versioned_cabs
(or something like that) top-level section, lest we break existing recipes which usecabs
.This is easily extended to supporting and checking optional recipe versions, e.g.
recipe: tron>=0.1
.Thoughts @sjperkins @SpheMakh @landmanbester?