Open FLavalliere opened 2 years ago
Pod-level caching, cloning content on an emptyDir volume or some static tmp directory, may be also acceptable.
Thank you @FLavalliere for creating your first issue on Flux! :tada:
I can think of two different ways that the controller could optimise its network usage, but the approach could differ depending on the use case.
If a deployment has high number of changes (either new tags being created or new commits on the target repositories), a copy of the repositories locally could save a full clone at each reconciling - as of v0.21
we are not shallow cloning, because libgit2
does not support it.
With long-lived clones the network cost for no-op
(when no changes took place) would be a simple fetch to refresh the git index. A downside to this approach is that if the repositories are very large, or there is a large amount of repositories, the disk requirements to run IAC could be substantially higher.
A challenge to consider implementing this is tenant isolation in multi-tenancy deployments (e.g. securely sharing cache cross-tenants, etc).
In use cases in which changes are infrequent (again, in either the container registry or target repositories), the same network optimisation can be attained for no-op
by simply holding the last successful reconciled hash. If the LastImage
hasn't changed since the previous successful reconciliation, and the target repositories still have the same HEAD hash, no changes (and therefore no clones) are required.
The upside of this is maintaining low network traffic and also low disk usage. The downside is that the HEAD hash verification would be an additional call, on top of clone, when changes took place - but that should not be a problem, based on the gains on not needing to clone every time.
I think approach 2 could potentially be the default, as that would save network usage across the board. The approach 1 I can see more like an opt-in
feature.
Do you mind sharing your thoughts and maybe a bit more of your use case/setup (stats like number of Repositories, Automation, reconciling interval, git repository avg size)?
I see both approach can be good remedies for cloning issues :+1:.
In my setup, image-automation-controller checks 10~20 dev image registries against a medium size (~20MB) repository across multiple namespaces. Container registries and the target repository does not updated so frequently (1 new tag per a week in registries and a few new commit per a day in the repository). Currently this automation runs at a short interval (1 or 2 min) to deploy new image as fast as possible for faster developments.
This can be in the case 2.
We have been experiencing a similar problem entailing a raise in our AWS costs. (because of a raise of traffic in our NAT gateway). We have around 20 image Repositories being scanned by ImageUpdateController (we have few image updates per day, perhaps 5 to 10 in total so approach 2 could do).
We realized the raise in cost post uploading a big file to our github repository (which was around 80 Mb post file addition). As a temporary solution we removed this file (from our history as well) lowering the size of our repo to 3.2 Mb.
Would anyone have time to work on the solutions mentioned by @pjbgf
Just as a follow-up on the subject, we have now implemented go-git
as the default Git implementation for the controller.
This opens up a new route, which would be to enable shallow clones, which is currently not supported but should be fairly straight-forward to implement it.
That would bring network bandwidth savings for clone operations. I looked at it recently and it worked fine, although there were some edge cases in which the commit ended-up being empty (saw this with GitHub only). I believe that was only when pushing to stale branches, which should now be fixed by the new opt-out feature gate GitForcePushBranch.
If someone is keen on giving a try at implementing this, I would recommend trying to set ShallowClone
to true
here: https://github.com/fluxcd/image-automation-controller/blob/43b99c65b6716dcb760ba8e0baede8b2bdc492f2/controllers/imageupdateautomation_controller.go#L293
And then test it to ensure there are no side effects.
Add persistent storage / caching mechanisim
We have quite a bit of repository and it seems that the ImageUpdateController does a lot of data transfer (maybe due to some configuration on our end that is too aggressive).
Our understanding is that the Image Automation Controller does an git clone using a temp directory and deletes its afterwards, as seen in the code below:
If there could be some kind of persistent volume that allows caching, that could help to reduce the bytes transfered.