Open mayank64ce opened 3 weeks ago
Our implementation is based on Intel's Trust Domain Extensions (TDX), operating similarly to a standard virtual machine. Maybe you can have a try through Azure? You will find it performs as the normal Guest VM. But if you want to achieve it in your computer, it's a hard work ^ ^. We got a lot of support from INTEl. Below is the cpu info: Following the setup steps( conf. virtualization components such as Qemu, OVMF, and Kata ), within the guest environment, you can observe that TDX memory encryption is active: Regarding GPU communication, our paper details an innovative approach we have developed. Also, you need to create a bridge network for guest VM.
Here is the implementation description for our rebuttal. I hope you find it helpful.
implement: We also recognize the gap between simulation experiments and real TEE-GPU environments. We have three considerations to bridge this gap:
The design of the simulation experiments is scalable, ensuring that the conclusions of comparative experiments between GroupCover and other schemes remain consistent across different environments. Note that the comparative literatures are built on the same architecture, namely Slalom [8]. In their scheme, SGX was used as the TEE of choice. By invoking SGX-related dynamic link libraries and implementing IO methods in Python, Slalom achieved joint inference with SGX+GPU. Changing the model obfuscation algorithm within the same architecture allows for an adaptive comparison of the efficiency between GroupCover and similar schemes, whether it's SGX+RTX2080, a regular VM+RTX3070, or a real TEE environment tested later. Thus, our simulation results hold across different environments.
Concerning the impact of TEE on inference performance, we conducted tests in an actual TEE environment. Considering possible security issues with SGX and its intrusive nature on user code, we are strongly inclined to implement our scheme in a real TEE environment and test inference efficiency. Limited by machine conditions and time, we conducted partial experiments in a real TEE environment, with the below GPU nodes being simulated by 64 cores CPUs, hoping our supplementary experimental data, added code in the anonymous open-source repository can address your concerns regarding the implementation.
debug=False
. Although trusted VMs can fully deploy larger memory capacities, to observe the delay caused by "processor context switching and memory copying," we aimed to allocate as little runtime memory to the trusted VMs as possible. To achieve this, we configured the trusted VMs with CPU=1, Memory=2G
as the TEE environment and loaded them with a transformer model having 200M parameters. Upon entering the trusted VM, it was observed that only 700M of allowable runtime memory remained, potentially triggering the TEE's memory performance wall during inference operations.+-----------------------------------+ +-----------------------------------+
|Trusted VM master | |GPU worker |
| RANK=0| | RANK=1|
|CPU=1, MEM=2GB | |Here is a 64C terminal |
| | | |
| manage the secure inference: | virbr0 | rpc.init() |
| rpc.init() |<------->| wait for input: |
| while model.forward: | | run linear_layer(input) |
| if linear layer: | | rpc.sync() |
| rpc.sync(input, layer) | | |
+-----------------------------------+ +-----------------------------------+
Experimental results prove that performing inference in a TEE environment does not significantly impact performance. The addition of an offload layer does indeed increase latency and decrease throughput but does not affect the comparison with the baseline (pure GPU inference). We assume the latency of pure GPU inference as the baseline of 1x. Given the rpc implementation of alexnet-CIFAR10 and a 200M parameter transformer-translate inference task under simulation without TEE on a 3070 physical machine, the latency of secure inference is respectively 5x and 2x. Switching to a real TEE environment, these ratios are 6x and 3x. Increasing the number of layers in the transformer model, this ratio remains essentially unchanged.
The inference throughput in a real TEE environment can approach the baseline solution by adjusting the distributed inference architecture. During the experiments, we found that the experiment was far from saturating the bandwidth of the network interface. Using the nload0
tool, we observed that the data rate through the virbr0
network interface during RPC calls was 110MB/s, far below the 3GB/s limit of the bridged network interface. Using htop
, we found that the CPU utilization of the Trusted VM was maxed out, but the GPU utilization was lower than the baseline. Therefore, we launched multiple Trusted VMs and found that within a suitable range, the inference throughput roughly increased proportionally with the number of master nodes. Specifically, pure GPU inference throughput was about 4k, with 1 Trusted VM as the master node the throughput was 1.2k, with 2 master nodes the throughput was 2.1k, and with 3 nodes it was 2.7k. Thus, with reasonable computational resource allocation, the inference throughput of GroupCover can approach the baseline solution.
We will further enhance the efficiency of the GroupCover secure inference framework and update the experimental data in the future. During tests in a real TEE environment, we discovered that the primary bottleneck for inference performance was the rpc communication. Neither the runv
nor runc
implementation methods could utilize read-write engines like RDMA. In the frequent interactions between master and worker nodes, the latency introduced by IO is the main factor affecting inference performance. Moving forward, we hope to start with the GPU's UVM (Unified Virtual Memory) management mechanism to explore a more reasonable IO method between Trusted VMs and GPUs. For instance, following the approach of the Nvidia H100, masked data could be read and written to memory shared by the Trusted VM and the host, and then the GPU could directly read from and write to this memory using DMA. This way, IO would occur through memory rather than the network, significantly speeding up inference efficiency. If you are interested in our work, we welcome you to continue discussing future endeavors with us.
Don't know if this is the right place to ask this or not, but I want to know how did you simulated TEE in this project ?
Thanks