use k8s cluster default imagePullPolicy

arno01 commented 2 years ago

Reproducer

build docker image "test:123" which would just "echo 123" and "docker push" it to the registry;
create the deployment (send-manifest);
get the logs from the provider (provider lease-logs);
close the deployment
repeat first 4 steps except that do some modification to the image in step 1, say "echo 555", but keep the same image tag "test:123".

Update: this is mostly about the image pull policy for the images with the :latest and untagged images as seen from below discussion. Other tags, such as test:123 should stay test:123 1:1 (immutable).

Expected behavior:

Provider should pull the new image if it has :latest tag or untagged.

Actual behavior:

The provider will not pull the new image, it will start the old image.

Workaround:

There is no workaround to a changed default K8s behavior.

Provider in question:

I haven't tested different providers.

"equinix-metal-ams1","akash","mn2-ng","https://provider.provider-0.prod.ams1.akash.pub:8443","akash14c4ng96vdle6tae8r4hc2w4ujwrsh3x9tuudk0"

arno01 commented 2 years ago

cc @dmikey

boz commented 2 years ago

@arno01, this is expected behavior. You need to use a different tag to deploy updates. Images are heavily cached, pushing changes to a single tag will cause both versions to be running very easily, even if we forced a pull every time.

arno01 commented 2 years ago

@arno01, this is expected behavior. You need to use a different tag to deploy updates. Images are heavily cached, pushing changes to a single tag will cause both versions to be running very easily, even if we forced a pull every time.

@boz this should be mentioned in the Akash Docs then, since I've already seen two cases where people would not figure why their image is not working at one provider but does at the other one. 1 2

And then this breaks the default Kubernetes behavior for the :latest image tag, since it should always be re-pulling images of this tag:

omit the imagePullPolicy and use :latest as the tag for the image to use; Kubernetes will set the policy to Always.

Source: https://kubernetes.io/docs/concepts/containers/images/#updating-images

I would actually be more inclined to let people specify in their deployment manifests the imagePullPolicy in order to get more flexible about this.

And then, most of people are going to be using same tags, e.g. nginx:stable, nginx:mainline, nginx:latest, nginx:1.21 (can replace nginx practically with any image). The issue I see with that (the providers are heavily caching the images) is that the security issues in the libraries of the images (nor the apps themselves in some cases, i.e. nginx:stable, nginx:1, nginx:latest) will not get pushed as people re-deploy their deployments should they get alerted by CVE reports.

boz commented 2 years ago

it is best practice to use a version tag. we can add that to the docs.

boz commented 2 years ago

It says as much right in your source

Caution: You should avoid using the latest tag when deploying containers in production, as it is harder to track which version of the image is running and more difficult to roll back to a working version.

Instead, specify a meaningful tag such as v1.42.0.

arno01 commented 2 years ago

@boz

It says as much right in your source

Caution: You should avoid using the latest tag when deploying containers in production, as it is harder to track which version of the image is running and more difficult to roll back to a working version.

Instead, specify a meaningful tag such as v1.42.0.

I agree with that recommendation, however the concern I've raised in this issue has nothing to do with that recommendation.

This is a recommendation for running the production containers which many of people do not. Many (probably even the majority) are going to be using no tag which defaults to :latest. (People are usually modifying their images then pushing them again and over again, with :latest tag.) (Edit: And this is not only about the :latest tag. See Update 1 below)

As I'm typing this, there is a 3rd occurrence I'm seeing people are struggling because of this and I'm having to explain them the Akash providers are heavily caching the images, they should use a new tag for every new image update.

My point is not to lobby for using the :latest tag but rather to fix the imagePullPolicy which is not currently set to Always for the untagged images (or tagged with :latest tag) on the Akash providers. I suggest to stick with the default imagePullPolicy otherwise there are more and more people will be hitting this issue, there is no doubt about it.

Update 1: And then, the same applies to the other than :latest image tags. There are plenty of projects leverage CI/CD for constantly building and re-pushing the same tags, when that happens, the libraries are getting updated (i.e. openssl library in the nginx container). In this case, the image tag stays always same, just as I've already described above https://github.com/ovrclk/akash/issues/1354#issuecomment-907866119 Which leads to security issues when Akash provider caches these images and is not re-pulling them.

Basically this deviation from the defaults is not just a nuisance but rather a security issue in not having to re-pull the images. (simplest example: nginx:stable won't get re-pulled, therefore, users won't get security updates on the redeployment nor other people who make their first nginx:stable deployment will likely to receive a stale cached copy of nginx:stable [which may turn out to not be as :stable as it sounds ;)])

zjor commented 2 years ago

100% agree with @arno01 I wish :latest worked out of the box

tidrolpolelsef commented 2 years ago

My recommendation back when mainnet launched was to block :latest, :stable etc. The labels don't actually mean anything unfortunately. For example, there is no guaranteed that :latest is in fact the latest.

andy108369 commented 2 years ago

For example, there is no guaranteed that :latest is in fact the latest.

It should guarantee that for the moment someone deploys the app, which isn't working that way as of now because of imagePullPolicy: IfNotPresent. It won't mean it's the :latest some prolonged time from the moment someone deployed the app.

This is how K8s works and have always been working by default everywhere for the :latest tag, I've mentioned that above

I'd prefer sticking with the widely accepted defaults (for imagePullPolicy) rather than inventing something new.

And should we want :latest to actually mean they will always be :latest, we should probably look into implementing something like https://keel.sh which would automatically take care updating the images whenever there is a new behind the :latest tag. But there should be a toggle for that, i.e. services.<name>.image_auto_update: true for example.

As a compromise, the imagePullPolicy could be set via the SDL i.e. services.<name>.imagePullPolicy: always for example.

andy108369 commented 2 years ago

To get the default image pull policy, we should only remove this line https://github.com/ovrclk/akash/blob/v0.16.4/provider/cluster/kube/builder/workload.go#L56

From the doc

When you first create a Deployment, StatefulSet, Pod, or other object that includes a Pod template, then by default the pull policy of all containers in that pod will be set to IfNotPresent if it is not explicitly specified. This policy causes the kubelet to skip pulling an image if it already exists.

So K8s sets PullIfNotPresent by default, just like we do, except that when we do it, we ignore these special conditions:

Default image pull policy When you (or a controller) submit a new Pod to the API server, your cluster sets the imagePullPolicy field when specific conditions are met:

if you omit the imagePullPolicy field, and the tag for the container image is :latest, imagePullPolicy is automatically set to Always;

if you omit the imagePullPolicy field, and you don't specify the tag for the container image, imagePullPolicy is automatically set to Always;

if you omit the imagePullPolicy field, and you specify the tag for the container image that isn't :latest, the imagePullPolicy is automatically set to IfNotPresent.

https://kubernetes.io/docs/concepts/containers/images/#imagepullpolicy-defaulting

As @tidrolpolelsef said, it's one of those weird config options where 'unset' has its own unique meaning :-)

andy108369 commented 1 year ago

PR https://github.com/ovrclk/provider-services/pull/54

andy108369 commented 1 year ago

FWIW, I've tested this further, apparently the HEAD requests are used for obtaining the image reference and they aren’t triggering the Docker Hub rate limiter upon Pod restart even with the imagePullPolicy: Always, as long as image reference haven't been updated on the remote. I.e., this triggers the rate limiter only when it actually pulls the new image, not just queries (HEAD) for the new one.

https://asciinema.org/a/541059

The Docker Hub API isn't too restrictive (100 pulls per 6 hours per IP address). Source

anilmurty commented 1 year ago

Wanted to update this thread after my discussion with @boz today: Pros of supporting/ allowing users to use the "latest" tag in SDLs (and pulling a fresh uncached image when used or no tag specified):

Allows autoupdating of container images in cases like presearch
Allows us to potentially build a CI/CD style pipeline where new artifacts get deployed automatically to akash
Allows easier integration with solutions like Fleek.co
Is consistent with Kubernetes default behavior

For these reasons we are going to work to allow this. The only concern/ downside is making sure we document this behavior clearly and note that a running pod will not be updated unless the pod is terminated (forcing a new pod to be spun up, which will result in the "latest" latest image to be pulled down) - which I believe is how solutions like the presearch "autopdater" handle this anyway (their autoupdater is a fork of https://github.com/containrrr/watchtower)

andy108369 commented 1 year ago

forcing a new pod to be spun up, which will result in the "latest" latest image to be pulled down

And this (the pod restart) can be easily triggerred from within the pod itself or the outside (through lease-shell, then kill <child process of PID 1> inside the pod). Either way, this can also be scheduled, etc

akash-network / support