Azure / karpenter-provider-azure

AKS Karpenter Provider
Apache License 2.0
385 stars 61 forks source link

Custom node image for utilizing pre-cached image layer #252

Open HakjunMIN opened 6 months ago

HakjunMIN commented 6 months ago

Tell us about your request

Currently it looks only predefined image can be supported on nodeclaim CRD. Can custom image which has pre cached container image layer be support there?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Using karpenter, have to use AI/ML images using NVIDIA image. When scaling out by karpenter, faster image pull is necessary using cached node cluster.

Are you currently working around this issue?

Artifact streaming in AKS but this is very slow than local cache. Also local cluster registry can be utilized but it is a burden.

Additional Context

No response

Attachments

No response

Community Note

tallaxes commented 6 months ago

Could you describe the process you use to build custom node image? (Especially interested in whether it is based off an AKS node image, since it affects bootstrap.)

HakjunMIN commented 6 months ago

Known as AKS node image is not able to customize, but would like to do as AWS way such as

I don't have an idea yet to build custom node image, but if there is a Packer or other tools to build AKS node image, want to let it have pre cached ML image.

My goal is to reduce time to pull large image of AI/ML using local cache in environment of Karpenter.

Bryce-Soghigian commented 6 months ago

The template and scripts used to build aks node images via packer compiling are all open source: https://github.com/Azure/AgentBaker.

That being said step 1 is to enable artifact streaming on karpenter nodes. I have a POC for this just didn't have the time to setup the e2e test as its a bit more involved. https://github.com/Azure/karpenter-provider-azure/pull/121 was the POC.

Custom Node Image isn't on the immediate plans as we are first making things reliable and stable, but artifact streaming may be a start.

One older project that may be worth mentioning is kamino: https://github.com/jackfrancis/kamino?tab=readme-ov-file The idea here IIRC is that we follow a prototype pattern. This prototype would have a conceptual "golden node". This golden node would have your cached images, then we snapshot that node, and use that node image for all of your nodes. This "golden node image" would have the things you need cached on the node.

When we do tackle something like this I imagine we will go into a direction like that so that the node image you are using has everything we need on the aks side and isn't doing too much but you still get that cache performance improvement.

HakjunMIN commented 6 months ago

@Bryce-Soghigian Thank you much. As you guided will try artifact streaming first then move to kamino. I believe kamino can be worked well with karpenter as well. Certainly I'll test it.

Bryce-Soghigian commented 6 months ago

Kamino will not work with karpenter in the projects current state for a couple of reasons.

  1. Karpenter currently pulls from Community Image Galleries, and doesn't use sig images.
  2. Kamino operates on the vmss datamodel. Karpenter provisions single instance vms rather than leveraging a scale set.
  3. Karpenter has no mechanism to query custom images in its current state.

I will get started on adding artifact streaming support. There is a fair bit of work to do before we can support a kamino style node image cache layer in karpenter.

HakjunMIN commented 6 months ago

@Bryce-Soghigian Oh. understood. But Artifact Streaming doesn't support Karpenter now? What is approximate ETA to add Artifact Streaming to Karpenter?

Bryce-Soghigian commented 6 months ago

@HakjunMIN I created a separate issue to track the artifact streaming work https://github.com/Azure/karpenter-provider-azure/issues/266.

Long term, it would be great to do something similar to what you are describing here in Karpenter. We still need to work through many other things first, however. Please subscribe to the artifact streaming issue for further updates there.

HakjunMIN commented 6 months ago

@Bryce-Soghigian

Below AWS link is perfect way to implement this. Beside of artifact streming, it would be great that a custom snapshot can be used for node class image. Could you add it to your backlogs?

https://github.com/aws-samples/bottlerocket-images-cache?tab=readme-ov-file#with-karpenter