[FEATURE]: PVCs for the garden-util /data mount

mowings commented 2 years ago

Background / motivation

We run our garden-projects in AWS EKS on spot instances. We also do regular scale-downs of all projects periodically. This means that the garden-util pod is frequently rescheduled, and because the buildSyncVolume is declated as type emptyDir, its contents are often lost. This means that rsyncs of source up to garden-util during deployments often have to start from scratch, rather than just sending up a few diffs.

For users with constrained upload bandwidth( ie, cable modem users) this can lead to needlessly long deployment times every time the garden-util pod gets rescheduled.

What should the user be able to do?

When the garden client rsyncs source code up to garden-util, the sync should always be fast -- even if garden-util has been rescheduled onto another pod.

Why do they want to do this? What problem does it solve?

Long deployments on eks spot instances/after scaledowns.

Suggested Implementation(s)

Allow more control over garden-util deployments. At the least, allow a pvc to be declared in the project yaml and used in place of an emptyDir volume for the buildSyncVolume. Even better, it would be great if we could OPTIONALLY provide our own deployment manifests for garden-util (and even kaniko, for that matter).

How important is this feature for you/your team?

:cactus: Not having this feature makes using Garden painful (for us, at least)

vvagaytsev commented 2 years ago

Thanks for reporting this, @mowings!

Making things stateful can make them harder to administer and maintain. We’ll check the implementation impacts with the team and will get back to you soon.

mowings commented 2 years ago

@vvagaytsev -- Note that the source code sync is already stateful. The state is just randomly lost when the pod is rescheduled. Allowing the end-users to provide their own manifests as an advanced feature would let you left-shift the issue back on to the end-user. I have even played around here with doing my own deployment to of the image to "fool" the garden client into thinking that the pod is already deployed.

stefreak commented 2 years ago

@mowings I understand the need, but there are a couple of implications:

new versions of garden may change how the in-cluster building works internally, and thus break with the custom manifests, thus offering custom manifests may make it harder for us to change the internals
with a PVC instead of emptyDir, there is a need for a clean up routine. So far due to the ephemeral nature of emptyDir clean up happened from time to time, but with PVC there needs to be a routine that is aware of the garden internals, so it doesn't interfere with ongoing builds

Thinking about your problem I have an idea for a workaround you can already try today. If you provide the option clusterBuildkit.nodeSelector you can make sure that the buildkit pods are scheduled to non-spot instances and thus are cleaned up less often. I acknowledge this might not fit your requirements (maybe you need to scale down to zero), but I wanted to mention it in case it can be an interim solution for your team.

mowings commented 1 year ago

Yeah -- that was my other thought. Not ideal for a couple of reasons -- we'll see though

mowings commented 1 year ago

Well, to be clear, I am talking about garden-util, not buildkit. Is there a nodeSelector spec for garden-util somewhere (we are using kaniko for builds atm)? Also we are using karpenter for provisioning, so things will still get shuffled around occasionally, but in theory less often than running on spot

stefreak commented 1 year ago

@mowings sorry for missing that you're using kaniko. There is providers[].kaniko.nodeSelector for kaniko. Quickly zooming through the kaniko code (which I am not super familiar with yet) the garden-util deployment also is configured to use the kaniko nodeselector, if it is configured in the kubernetes provider config in your project.garden.yml.

Please let us know if this helped / it reduces your problems for now, or not.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any activity in 90 days. It will be closed in 14 days if no further activity occurs (e.g. changing labels, comments, commits, etc.). Please feel free to tag a maintainer and ask them to remove the label if you think it doesn't apply. Thank you for submitting this issue and helping make Garden a better product!

twelvemo commented 1 year ago

@stefreak i think even given the extra hassle required to manage persistent volumes this feature request makes sense. It would solve emptied build caches for users that use cluster-autoscaling and users that use Garden Cloud's automatic environment cleanup feature. What do you think?

porterchris commented 1 year ago

+1 on the scope of this issue being broader than just spot instance based. AEC causes the same "need to start over" issue after scaling down the pods during cleanup.

Also, size/complexity of the codebase is also an important factor of when/how often this issue can cause usability issues.

stefreak commented 1 year ago

Should the PVCs be cleaned up after some time they haven't been used, e.g. after 2 weeks? Maybe we could control this through a garden annotation.

garden-io / garden