fluxcd / image-automation-controller

GitOps Toolkit controller that patches container image tags in Git
https://fluxcd.io
Apache License 2.0
175 stars 71 forks source link

Git Caching #271

Open FLavalliere opened 2 years ago

FLavalliere commented 2 years ago

Add persistent storage / caching mechanisim

We have quite a bit of repository and it seems that the ImageUpdateController does a lot of data transfer (maybe due to some configuration on our end that is too aggressive).

Our understanding is that the Image Automation Controller does an git clone using a temp directory and deletes its afterwards, as seen in the code below:

tmp, err := os.MkdirTemp("", fmt.Sprintf("%s-%s", originName.Namespace, originName.Name))

if err != nil {
    return failWithError(err)
}

defer os.RemoveAll(tmp)

If there could be some kind of persistent volume that allows caching, that could help to reduce the bytes transfered.

nonylene commented 2 years ago

Pod-level caching, cloning content on an emptyDir volume or some static tmp directory, may be also acceptable.

pjbgf commented 2 years ago

Thank you @FLavalliere for creating your first issue on Flux! :tada:

I can think of two different ways that the controller could optimise its network usage, but the approach could differ depending on the use case.

1) High number of changes

If a deployment has high number of changes (either new tags being created or new commits on the target repositories), a copy of the repositories locally could save a full clone at each reconciling - as of v0.21 we are not shallow cloning, because libgit2 does not support it.

With long-lived clones the network cost for no-op (when no changes took place) would be a simple fetch to refresh the git index. A downside to this approach is that if the repositories are very large, or there is a large amount of repositories, the disk requirements to run IAC could be substantially higher.

A challenge to consider implementing this is tenant isolation in multi-tenancy deployments (e.g. securely sharing cache cross-tenants, etc).

2) Low number of changes

In use cases in which changes are infrequent (again, in either the container registry or target repositories), the same network optimisation can be attained for no-op by simply holding the last successful reconciled hash. If the LastImage hasn't changed since the previous successful reconciliation, and the target repositories still have the same HEAD hash, no changes (and therefore no clones) are required.

The upside of this is maintaining low network traffic and also low disk usage. The downside is that the HEAD hash verification would be an additional call, on top of clone, when changes took place - but that should not be a problem, based on the gains on not needing to clone every time.


I think approach 2 could potentially be the default, as that would save network usage across the board. The approach 1 I can see more like an opt-in feature.

Do you mind sharing your thoughts and maybe a bit more of your use case/setup (stats like number of Repositories, Automation, reconciling interval, git repository avg size)?

nonylene commented 2 years ago

I see both approach can be good remedies for cloning issues :+1:.

In my setup, image-automation-controller checks 10~20 dev image registries against a medium size (~20MB) repository across multiple namespaces. Container registries and the target repository does not updated so frequently (1 new tag per a week in registries and a few new commit per a day in the repository). Currently this automation runs at a short interval (1 or 2 min) to deploy new image as fast as possible for faster developments.

This can be in the case 2.

tibz-enex commented 2 years ago

We have been experiencing a similar problem entailing a raise in our AWS costs. (because of a raise of traffic in our NAT gateway). We have around 20 image Repositories being scanned by ImageUpdateController (we have few image updates per day, perhaps 5 to 10 in total so approach 2 could do).

We realized the raise in cost post uploading a big file to our github repository (which was around 80 Mb post file addition). As a temporary solution we removed this file (from our history as well) lowering the size of our repo to 3.2 Mb.

Would anyone have time to work on the solutions mentioned by @pjbgf

pjbgf commented 1 year ago

Just as a follow-up on the subject, we have now implemented go-git as the default Git implementation for the controller. This opens up a new route, which would be to enable shallow clones, which is currently not supported but should be fairly straight-forward to implement it.

That would bring network bandwidth savings for clone operations. I looked at it recently and it worked fine, although there were some edge cases in which the commit ended-up being empty (saw this with GitHub only). I believe that was only when pushing to stale branches, which should now be fixed by the new opt-out feature gate GitForcePushBranch.

If someone is keen on giving a try at implementing this, I would recommend trying to set ShallowClone to true here: https://github.com/fluxcd/image-automation-controller/blob/43b99c65b6716dcb760ba8e0baede8b2bdc492f2/controllers/imageupdateautomation_controller.go#L293 And then test it to ensure there are no side effects.

pjbgf commented 1 year ago

I have created a draft PR with the Shallow Clone feature. Please refer to the PR for more information on how to access the RC image and test it.