Closed ahnsn closed 3 months ago
If this is about getting latest version of a package automatically, I think you should consider pinning to exact versions (or at least >=
), since you clearly don't want any numpy, but the specific version of a package.
So instead of
numpy
you could use
numpy>=2.0.0
if that's what you really wanted.
Racetrack jobs strive to be immutable and reproducible, so that you can always recreate the same state, no matter when it's being built. Having that in mind, clearing cached images would never be the case.
So I think docker build
just checks the hash of requirements.txt
when deciding what layers come from the cache. Rather than mess with with how these checks are done, I think either pinning the versions (as Ireneusz suggest above) or using --no-cache
would solve the above.
I'll get to work on implementing a --no-cache
flag for the racetrack CLI.
Whatever the case is, I'd disagree with letting this be --force
behaviour. As Irek says, racetrack is on purpose reproducible. I don't think --force
should be capable of breaking this. Adding a new flag, --delete-cache
or similar (since, I assume, the cache would be lost?) would be an agreeable solution if pinning versions isn't good enough though. That way the option communicates that you are in fact throwing reproducibility away.
I was a bit quick on the keys, --no-cache
of course just ignores cache, so you'd still have a presumably reproducible build cached. That's fine then. Still, optimal solution would be to not update jobs in non-reproducible ways, part of me doesn't like implementing spades for users to dig graves with.
In fact, I think --force
flag has nothing to do with it, cause even without this flag the docker engine may still make use of cached layers.
Should we let users pass arbitrary flags to docker build
? Or is that a footgun? A shovel? Some other metaphor for a bad idea? Do we want to enable a limited set of handpicked flags?
Arbitrary flags, I think that's too much. I'm fine with the --no-cache
flag though as it can't do much harm.
I think --no-cache
is the least bad idea if this is an issue that we want to fix. I think the heavy lifting is deciding whether or not we want to support unreproducible, non-idempotent builds.
The usecase represented here is "I want packages to have a specific version, but I don't want to know which", which is why I'm calling --no-cache
a spade. If you don't care about the versions, don't specify them, you're probably fine. If you do care, then not specifying them is probably not the best practice.
I mean, I assume you want the newest versions for some actual reason? So you should be specifying at least numpy>=2.0.0
or whatever as previously mentioned. Not just praying to the god of the machine that he gives you a version that you like.
When redeploying a model with the
--force
flag, racetrack seems to use cached images from the job, if the job's code doesn't seem to have changed. That is for example when underlying code, that the job imports, has changed.