aws / aws-nitro-enclaves-cli

Tooling for Nitro Enclave Management
Apache License 2.0
127 stars 81 forks source link

Can I assign a GPU resource to an enclave? #543

Open BaiChienKao opened 1 year ago

BaiChienKao commented 1 year ago

I'm currently engaged in research involving enclaves and I'm interested in optimizing certain applications by utilizing GPU resources. Unfortunately, I cannot find a way to assign a GPU resource to an enclave. My research from 2021 indicated that this feature was not supported. I'm curious if there have been any developments since then, and whether GPU assignment for enclaves is now possible.

meerd commented 7 months ago

Hello @BaiChienKao,

Enabling GPU attachment for Enclaves is on our radar, but there are no immediate plans to implement this feature.

andrcmdr commented 2 months ago

This should be set as a top priority for AWS cloud now, in the light of AI technologies evolving and the appearance of first GPU TEE discrete adapters (Hopper H100 and Blackwell H200 architectures from NVidia) for CC (confidential computing mode) on GPU, and 'cause P5 and P5e EC2 instances with H100 already available in AWS cloud.

But looks like Nitro is still not support GPU TEE for AWS cloud and not support enabling discrete adapters on a PCI bus, although the NSM module itself is a virtual (virtio based) PCI device to interact with Nitro hypervisor (hope its code will be published as well, as it is based on KVM - this will improve the chain of trust and will gives improved attestation for all components of the Nitro platform).

There are other options available - the KVM/QEMU VMs with support for AMD SEV-SNP or Intel TDX, VM based CPU TEE, and NVidia's Hopper/Blackwell MIG TEE enabled with NVtrust. But AWS cloud and Nitro still has a great usability to run confidential computing resources.

Guys and gals, you definitely should take this into more closer consideration and implement it ASAP in near perspective.

Cc @meerd @andraprs @eugkoira @axlprv @agraf @jdbean

Our ML researching and cloud infrastructure teams at @sentient-xyz (https://sentient.foundation) are really do need GPU TEE feature for P5 and P5e instances with H100/H200 GPUs with support of on-chip confidential computing (MIG based TEE in Hopper architecture) in isolated GPU memory. This is essential for training and fine-tuning large models on sensitive non-public data.

Found only this article, which mentioned P5, P5e and Nitro, but doesn't give any meaningful information about support of GPU TEE and only gives false expectations. In fact article only mentioned the 3,200 Gbps of Elastic Fabric Adapter (EFA) v2 networking and that up to 3200 Gbps of EFA networking enabled by AWS Nitro System, i.e. Nitro here is mentioned only in context of networking while for nd users it is mostly a NSM module interacting with hypervisor though IOCTL bus for VM based TEE.

https://aws.amazon.com/blogs/machine-learning/introducing-three-new-nvidia-gpu-based-amazon-ec2-instances/

We have combined NVIDIA’s powerful GPUs with differentiated AWS technologies such as AWS Nitro System, 3,200 Gbps of Elastic Fabric Adapter (EFA) v2 networking, hundreds of GB/s of data throughput with Amazon FSx for Lustre, and exascale computing with Amazon EC2 UltraClusters to deliver the most performant infrastructure for AI/ML, graphics, and HPC.

To power the development, training, and inference of the largest large language models (LLMs), EC2 P5e instances will feature NVIDIA’s latest H200 GPUs, which offer 141 GBs of HBM3e GPU memory, which is 1.7 times larger and 1.4 times faster than H100 GPUs. This boost in GPU memory along with up to 3200 Gbps of EFA networking enabled by AWS Nitro System will enable you to continue to build, train, and deploy your cutting-edge models on AWS.

TonyGiorgio commented 1 month ago

I would also like this. For now, I'm connecting my enclave to another provider that runs their stuff on azure's confidential compute in order to get the H100 TEE feature.