Open mowings opened 2 years ago
Thanks for reporting this, @mowings!
Making things stateful can make them harder to administer and maintain. We’ll check the implementation impacts with the team and will get back to you soon.
@vvagaytsev -- Note that the source code sync is already stateful. The state is just randomly lost when the pod is rescheduled. Allowing the end-users to provide their own manifests as an advanced feature would let you left-shift the issue back on to the end-user. I have even played around here with doing my own deployment to of the image to "fool" the garden client into thinking that the pod is already deployed.
@mowings I understand the need, but there are a couple of implications:
emptyDir
clean up happened from time to time, but with PVC there needs to be a routine that is aware of the garden internals, so it doesn't interfere with ongoing buildsThinking about your problem I have an idea for a workaround you can already try today. If you provide the option clusterBuildkit.nodeSelector
you can make sure that the buildkit pods are scheduled to non-spot instances and thus are cleaned up less often. I acknowledge this might not fit your requirements (maybe you need to scale down to zero), but I wanted to mention it in case it can be an interim solution for your team.
Yeah -- that was my other thought. Not ideal for a couple of reasons -- we'll see though
Well, to be clear, I am talking about garden-util, not buildkit. Is there a nodeSelector spec for garden-util somewhere (we are using kaniko for builds atm)? Also we are using karpenter for provisioning, so things will still get shuffled around occasionally, but in theory less often than running on spot
@mowings sorry for missing that you're using kaniko. There is providers[].kaniko.nodeSelector
for kaniko. Quickly zooming through the kaniko code (which I am not super familiar with yet) the garden-util
deployment also is configured to use the kaniko nodeselector, if it is configured in the kubernetes provider config in your project.garden.yml
.
Please let us know if this helped / it reduces your problems for now, or not.
This issue has been automatically marked as stale because it hasn't had any activity in 90 days. It will be closed in 14 days if no further activity occurs (e.g. changing labels, comments, commits, etc.). Please feel free to tag a maintainer and ask them to remove the label if you think it doesn't apply. Thank you for submitting this issue and helping make Garden a better product!
@stefreak i think even given the extra hassle required to manage persistent volumes this feature request makes sense. It would solve emptied build caches for users that use cluster-autoscaling and users that use Garden Cloud's automatic environment cleanup feature. What do you think?
+1 on the scope of this issue being broader than just spot instance based. AEC causes the same "need to start over" issue after scaling down the pods during cleanup.
Also, size/complexity of the codebase is also an important factor of when/how often this issue can cause usability issues.
Should the PVCs be cleaned up after some time they haven't been used, e.g. after 2 weeks? Maybe we could control this through a garden annotation.
Background / motivation
We run our garden-projects in AWS EKS on spot instances. We also do regular scale-downs of all projects periodically. This means that the garden-util pod is frequently rescheduled, and because the buildSyncVolume is declated as type emptyDir, its contents are often lost. This means that rsyncs of source up to garden-util during deployments often have to start from scratch, rather than just sending up a few diffs.
For users with constrained upload bandwidth( ie, cable modem users) this can lead to needlessly long deployment times every time the garden-util pod gets rescheduled.
What should the user be able to do?
When the garden client rsyncs source code up to garden-util, the sync should always be fast -- even if garden-util has been rescheduled onto another pod.
Why do they want to do this? What problem does it solve?
Long deployments on eks spot instances/after scaledowns.
Suggested Implementation(s)
Allow more control over garden-util deployments. At the least, allow a pvc to be declared in the project yaml and used in place of an emptyDir volume for the buildSyncVolume. Even better, it would be great if we could OPTIONALLY provide our own deployment manifests for garden-util (and even kaniko, for that matter).
How important is this feature for you/your team?
:cactus: Not having this feature makes using Garden painful (for us, at least)