Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

Feature request: custom VHD or a way to prepull docker images offline #1532

Open Timvissers opened 4 years ago

Timvissers commented 4 years ago

Request: a way to prepull custom docker images. Docker images that are already available in the worker node so that they do not have to be pulled after the node is being started due to a scale up event.

Context: We have huge docker images (>10GB) which have already been optimized in size. I tested pulling a docker image (from a premium ACR from a geolocation that was in place) triggered by a kubernetes job, it takes about 6 minutes. We run kubernetes jobs for which it's crucial to start as soon as possible. We are dealing with +/- 2m scale up time for a new node, but we cannot deal with 6 extra minutes being lost on the docker pulling.

I saw the current VHD packer scripts, which are already prepulling docker images. This request is to bring this to the customer.

zhiweiv commented 4 years ago

We'd like this feature too, we have similar use case.

0x53A commented 4 years ago

This is especially important for Windows nodes

jluk commented 4 years ago

@timvissers thanks for opening this request, do I read the root problem is image pull time is too long for your scenario? If so could I retitle your request as "Reduce image pull-time on AKS scale"?

We have options to address that such as integrating project teleport. https://azure.microsoft.com/en-us/resources/videos/azure-friday-how-to-expedite-container-startup-with-project-teleport-and-azure-container-registry/

zhiweiv commented 4 years ago

I think for now, especially for Windows containers, pre cache base images is the easiest and most stable way.

Timvissers commented 4 years ago

@jluk Thanks for your comment. I will investigate the teleport, I was unaware of this.

I would suggest to not rename the request to 'reduce pull-time'. Maybe it should be called 'customisable worker nodes' or so? Because other usages for custom worker nodes besides offline prepulling of docker images could be to install extra (prometheus eg) exporters or filebeat collectors for logging or other software that could be of use for teams on worker nodes.

I think AKS is running a bit behind in the topic of worker node customization compared to other big cloud providers' managed kubernetes solutions. I do see in AKS engine that packer is already in use, so the effort would be to just bring this to the customer.

jluk commented 4 years ago

@timvissers the teleport integration requires a dependency chain to be unblocked, but it is a path we're investigating to reduce image pull time issues.

To clarify my previous ask to rename - I would like to understand the specific needs of customizations needed which is causing the ask for a BYO image scenario. Often the items needing customization on the OS level have alternative solutions on the existing OS or we already plan to address the root problem (like slow image pull time via teleport).

As for custom OS for worker nodes that are managed by cloud providers, there are none to my knowledge which will give you actual support/management of customized nodes. AKS is quite clear in this by only offering a managed node which qualifies for true Azure support / on-call. Any full-customization needed can be done with AKS-Engine which does not provide support, but the full suite of customization you could hope for.

The support experience you will face is very wide if you try to get help on an unmanaged node "BYO image" from any provider. That being said if you are comfortable with no support on a BYO image and acknowledge you only get support on the control plane, are your requirements still met?

zhiweiv commented 4 years ago

Our requirement is relative simple, pre cache .net/asp.net base image to improve startup time of pods on scaled up Windows nodes, our workloads are all based on .net framework, it takes a long time to pull and extract the base image.

The best apporach: AKS provides additional Windows image skus with these base images out of box, We can choose the the SKU while creating Windows pool.

The second apporach: AKS provides the ability with BYO images, we build images based on offical AKS images. Only control plane is supported by Azure, we take care of worker nodes by ourself.

Timvissers commented 4 years ago

Thank you for giving some extra insights to me. Also about AKS-Engine. But currently I don't think AKS-Engine is the best option to me for the following reasons:

So, yes, I'm ok with no support on the data plane, but I'm not yet at the point that I'm ok with no support at the master plane. In this case, I would be taking a supported base image and just adding some docker pull statements in a packer file. So those changes are minor. We are already doing this for 1,5 year on another cloud provider. We are planning to migrate to Azure, hence this feature request.

I am open to alternative solutions, but it's just that for me there seems to be no easy one:

Other options:

jluk commented 4 years ago

Thanks for all the feedback - @MikkelHegn as FYI on the Windows caching requests from @zhiweiv. @zhiweiv if you were provided a BYO image scenario, would zero support of the data plane also be acceptable?

@Timvissers I'm assuming you're running quite a large Linux image or is it Windows? A 13GB image taking ~5 minutes is about what I would expect, you are correct that wait is incurred by both pull time and decompression.

Thanks for confirming no support of the data plane is acceptable to you if you bring your own nodes, this is something we're open to discussing. Would you mind sharing other generic requirements you may have for customizing OS nodes, I read you mentioned additional OS logging/binaries?

zhiweiv commented 4 years ago

We are ok with zero support of data plane in BYO images scenario.

Timvissers commented 4 years ago

@jluk We use Linux on Standard_F8s_v2, 100gb disk

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in 90 days. It will be closed if no further activity occurs. Thank you!

zhiweiv commented 4 years ago

Any update?

palma21 commented 4 years ago

It seems this thread is leaning a bit towards BYO Image support which is not something we're planning on the foreseeable future right now.

I've created this specific issue specifically Teleport support which is being worked on. https://github.com/Azure/AKS/issues/1785

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

EPinci commented 3 years ago

Hey, as an update to this, there are two features coming up. First one is "Scale down mode" #2061 that will allow you to "turn off" one nodes without decommissioning the image (and thus not loose the pulled images) and "Teleport" #1785 that caches image layers as already mentioned in the thread. You can look at the mentioned issue for details.

Thank you.