Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 312 forks source link

[Feature] GPUs for Windows AKS #2809

Open laserprec opened 2 years ago

laserprec commented 2 years ago

Feature ETA: This feature is currently in Public Preview. GA is planned for early 2025.

We have dedicated GPU workload that can only be ran on Windows. It would be great if we can leverage AKS for such load. We are aware of the the GPU-enabled Linux nodes, so we are curious if the support for GPU-enabled Windows nodes is in the feature roadmap, and if so, we would love to know if you have estimated timeline for its availability.

Thank you!

ghost commented 2 years ago

Hi laserprec, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 2 years ago

Triage required from @Azure/aks-pm

ghost commented 2 years ago

Action required from @Azure/aks-pm

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

@immuzz, @justindavies would you be able to assist?

Issue Details
Hi AKS team, We have dedicated GPU workload that can only be ran on Windows. It would be great if we can leverage AKS for such load. We are aware of the the [GPU-enabled Linux nodes](https://docs.microsoft.com/en-us/azure/aks/gpu-cluster), so we are curious if the support for **GPU-enabled Windows nodes** is in the feature roadmap, and if so, we would love to know if you have estimated timeline for its availability. Thank you!
Author: laserprec
Assignees: -
Labels: `feature-request`, `triage`, `windows`, `action-required`, `Needs Attention :wave:`
Milestone: -
EliiseS commented 1 year ago

I have a customer who's also interested in this feature, are there any updates in this? They are doing cloud 3D rendering.

marosset commented 1 year ago

I have a guide on how you can manually configure GPU acceleration for Windows AKS nodes at https://github.com/marosset/aks-windows-gpu-acceleration

This does require installing the nvidia driver extension against the VMSS which backs the AKS Windows node pool and is not an ideal solution. It would be great if AKS could configure Windows nodes with the appropriate drivers!

adamrehn commented 1 year ago

@allyford I'd like to highlight a blocker that will need to be resolved in order to provide proper bin packing and scaling functionality for GPU accelerated Windows workloads on AKS. As discussed in https://github.com/microsoft/Windows-Containers/issues/333, the DirectX Graphics Kernel is not currently Silo-aware, and will expose all GPUs that are present on the host system to any container that requests a GPU. This results in incorrect behaviour when attempting to allocate individual GPUs to containers on Kubernetes worker nodes that have more than one GPU, and currently limits practical use to single-GPU VM types for worker nodes.

@fady-azmy-msft was previously handling the issue for tracking this blocker, and has directed me to continue the discussion here with the AKS team in this thread instead. All of the relevant technical details are available in both the Windows Containers issue thread and the blog post Bringing full GPU support to Windows containers in Kubernetes, the latter of which also discusses the broader implications for deploying and scaling Windows GPU workloads on Kubernetes. Please let me know if there's any additional information that I can provide, or anything that I can do to help.

EliiseS commented 1 year ago

@adamrehn Will this also be a problem with a single GPU machine in a scenario where that GPU is used by multiple pods?

TBBle commented 1 year ago

It won't really affect that scenario, since in that case you're either only exposing one GPU to multi pods already, or you were using an external system to only assign the GPU to one pod, and other pods don't get any GPU, so shouldn't be activating the GPU acceleration for the other pods, and they won't get access to the GPU.

adamrehn commented 1 year ago

+1 to what Paul said above. Everything works as expected for worker nodes with a single GPU, including exposing that GPU to multiple containers (e.g. when enabling the multitenancy option of the Kubernetes Device Plugins for DirectX).

allyford commented 8 months ago

AKS has just released Windows GPU on AKS in public preview. Please take a look at our documentation to test it out! If you have any feedback, please let us know.

allyford commented 3 months ago

We will be adding driver type selection for Windows GPU usage. This means that you'll be able to specify GRID or CUDA. Preview release expected in Sept 2024.

allyford commented 2 months ago

GPU Driver type selection on track for Sept preview. See #4505 for updates