aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.6k stars 919 forks source link

Allow ephemeral-storage capacity overrides for instance types (per node template or provisioner) #2723

Closed wkaczynski closed 7 months ago

wkaczynski commented 1 year ago

Tell us about your request

Currently there is no way of letting karpenter know that during the bootstrap of a node with nvme instance volumes, kubelet root is re-mounted to an array created out of the nvme instance volumes effectively changing the ephemeral-storage capacity of the node. Possible solutions would be:

https://github.com/aws/karpenter/pull/2390 seems to offer some interesting options as well

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Without the ephemeral-storage overrides, karpenter will be unable to select instances with ephemeral-storage provided using instance-volume-backed array for pods with ephemeral-storage requirements unless there is an ebs volume added to blockDeviceMappings that is matching the array size (that will effectively be unused)

Are you currently working around this issue?

There is currently no good workaround, in our case we need to add additional requirement to the provisioner (or pods) on "karpenter.k8s.aws/instance-local-nvme" to not provision instances with instance nvme storage smaller than our ebs configuration (otherwise the pods would never be scheduled on the boostrapped nodes). The issue is also that if karpenter choses to add bigger instances it is very likely to overprovison nodes (the pods will eventually schedule on a smaller number of nodes due to ephemeral-storage >> ebs size) and then remove empty nodes.

Another workaround could be to match the ebs size in blockDeviceMappings to the nvme instance total size which would generate additional costs (and the ebs would be effectively unused)

Additional Context

No response

Attachments

No response

Community Note

jonathan-innis commented 1 year ago

We have an open PR #2554 that's working on surfacing instance store volumes through the AWSNodeTemplate. We are discussing taking that PR a step further where, if you specify this virtualName in the NodeTemplate, we would make some assumption about your intention to use the instance store volume for your ephemeral storage and then use that value as the ephemeral-storage size.

I think eventually, this work extends into Instance Type Settings and #2390. Ideally, we could discover the instance store for a given instance type and always assume that this is being used for ephemeral-storage so that no user-based configuration was needed.

wkaczynski commented 1 year ago

Would this step further for https://github.com/aws/karpenter/pull/2554 also address the case of multiple nvme instance volumes ? Would we need to explicitly map each instance volume separately in the blockDeviceMappings section ? (meaning potentially needing separate aws node templates for each instance type in the same family)

https://github.com/aws/karpenter/pull/2390 - could be a good option too but it wouldn't support different nvme array setups for the same instance types (ideally we'd allow this for a pair of aws node template (or a provisioner) & instance type) (not that it wouldn't solve the issue we're facing as we're always creating raid0 composed of all available nvme instance volumes)

jonathan-innis commented 1 year ago

also address the case of multiple nvme instance volumes

Yes, this should address the case of multiple nvme instance volumes; however, you are correct that without #2390, you would have to create a separate Provisioner for each different array setup.

could be a good option too but it wouldn't support different nvme array setups for the same instance types

There's some extensions of #2390 that we have thought about where you could proxy instance type setups to create your own "custom" instance type, but that seems a bit further down the line.

wkaczynski commented 1 year ago

That sounds great, proxy instance type setups could add a lot of flexibility.

I guess we can live with either the extension to https://github.com/aws/karpenter/pull/2554 (assuming it does not break the bottlerocket boostrap image nvme setup from https://github.com/bottlerocket-os/bottlerocket/discussions/1991#discussioncomment-3265188) or with the basic functionality of https://github.com/aws/karpenter/pull/2390.

Please give us an update once the approximate timeline for availability of any of these options is known.

jonathan-innis commented 1 year ago

Sure @wkaczynski, I think @bwagner5 as the assignee on #2544 should be able to give you a good timeline on that PR to allow the initial NVME functionality.

For instance types and #2390, this was put on the backburner in favor of some other work but it should be re-picked up soon. Once the RFC goes in, that should be a good indicator of when the work is about to start.

wkaczynski commented 1 year ago

also address the case of multiple nvme instance volumes

Yes, this should address the case of multiple nvme instance volumes; however, you are correct that without https://github.com/aws/karpenter/pull/2390, you would have to create a separate Provisioner for each different array setup.

just a thought - with separate provisioners (as the provisioners for different array setups (like just different nvme disk counts) would be selected at random) we wouldn't necessarily see instances with the optimal cost selected, right ? (so we could end up getting bigger and more expensive instances than needed)

bwagner5 commented 1 year ago

I don't think #2554 will take care of this use-case. Even if instance-stores can be mapped as a block device, it doesn't indicate the configuration that the volumes would be used as (i.e. if 2 volumes are mapped does it mean they'll be in a RAID-0, RAID-1, ... etc).

I'm wondering if it would make more sense to configure the instance-store volumes within the Karpenter AMI Family itself. We could, by default within the AL2 amiFamily, RAID the volumes and remount where kubernetes components point to storage.

cep21 commented 1 year ago

you would have to create a separate Provisioner for each different array setup

Are there examples for this? Would this happen inside spec.userData or spec.blockDeviceMappings?

bwagner5 commented 1 year ago

There is on-going work in the eks optimized AL2 AMI to setup a RAID-0 out of instance storage disks and remount containerd and kubelet. Once that PR is merged into the EKS Optimized AMI, we can then set the bootstrap flag within Karpenter to enable the new functionality and adjust the node ephemeral-storage capacity to assume that we'll use a RAID-0 setup for instance types with NVMe instance storage. https://github.com/awslabs/amazon-eks-ami/pull/1171

dschaaff commented 1 year ago

We'll want that option for bottlerocket too. We use a startup container to format and mount a raid array of the disks already, we just need to have a way to properly account for the node ephemeral storage in the kubelet.

cep21 commented 1 year ago

Our current karpenter instances are using EBS volumes since that's what's currently supported by karpenter. We don't need EBS volumes, and would rather use the instance storage for ephermeral. This would save thousands of dollars a month on our AWS bill. Very excited to see this ticket get progress.

ryanschneider commented 1 year ago

In the meantime, now that the new EKS AMI has setup-local-disks I think we can add this to our AWSNodeTemplate to get the SSDs setup:

userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="BOUNDARY"

    --BOUNDARY
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    /bin/setup-local-disks raid0

    --BOUNDARY--
taylorturner commented 1 year ago

@ryanschneider any update on whether or not your suggestion is working?

ryanschneider commented 1 year ago

@taylorturner we ended up using a custom script since we wanted more control than setup-local-disks provided but I did test it once and it seemed to work.

purnasanyal commented 11 months ago

Yes, This feature is important

alec-rabold commented 11 months ago

This would be a very useful feature for us too; took a stab at a possible solution: https://github.com/aws/karpenter/pull/4735

armenr commented 10 months ago

This is becoming a cost-prohibitive issue for our company as well.

armenr commented 9 months ago

Following up -- is there any intention to listen to the customers here, and let us save on thousands of dollars of wasted spend, monthly?

jonathan-innis commented 9 months ago

is there any intention to listen to the customers here, and let us save on thousands of dollars of wasted spend, monthly

Apologize for the miss on response here. There's a small number of maintainers trying to keep up across a number of requests on the project. We were working hard to push out the beta and some other high-pri features and now that we are unblocked on those, will start to burndown the list of open PRs that are out there.

Looking at #4735 at a high-level, it sounds like a fairly reasonable approach to me. It allows the user to specify which way they want to go with their NVME storage and then will configure them as such šŸŽ‰

armenr commented 9 months ago

@jonathan-innis - Thank you for the prompt and informative follow-up. We all appreciate the enormous amount of work you're all putting in.

And thank you for putting eyes on this specific issue, and the accompanying PR. It's really exciting to know it's getting attention, and seems to be coming down the pipeline soon šŸ˜Ž

Thanks again! šŸ™ŒšŸ¼

cep21 commented 9 months ago

For some numbers, we've noticed a consistent 18% of our "EC2-other + EC2 Instances" bill is spent on these non-ephemeral disks, due to the large container images we deploy.

This ticket will have a real, material, and noticeable impact on the cost to run services in AWS.

jonathan-innis commented 7 months ago

I think since #4735 got merged, we can consider this one closed. https://github.com/aws/karpenter-provider-aws/issues/2394 should cover the other case where we want to scale EBS volumes dynamically based on pod resource requests.