ZzzzMe / GroupCover

icml24
MIT License
3 stars 1 forks source link

How to simulate TEE ? #1

Open mayank64ce opened 3 weeks ago

mayank64ce commented 3 weeks ago

Don't know if this is the right place to ask this or not, but I want to know how did you simulated TEE in this project ?

Thanks

ZzzzMe commented 3 weeks ago

Our implementation is based on Intel's Trust Domain Extensions (TDX), operating similarly to a standard virtual machine. Maybe you can have a try through Azure? You will find it performs as the normal Guest VM. But if you want to achieve it in your computer, it's a hard work ^ ^. We got a lot of support from INTEl. Below is the cpu info: image Following the setup steps( conf. virtualization components such as Qemu, OVMF, and Kata ), within the guest environment, you can observe that TDX memory encryption is active: image Regarding GPU communication, our paper details an innovative approach we have developed. Also, you need to create a bridge network for guest VM.

ZzzzMe commented 2 weeks ago

Here is the implementation description for our rebuttal. I hope you find it helpful.

  1. implement: We also recognize the gap between simulation experiments and real TEE-GPU environments. We have three considerations to bridge this gap:

    1. The design of the simulation experiments is scalable, ensuring that the conclusions of comparative experiments between GroupCover and other schemes remain consistent across different environments. Note that the comparative literatures are built on the same architecture, namely Slalom [8]. In their scheme, SGX was used as the TEE of choice. By invoking SGX-related dynamic link libraries and implementing IO methods in Python, Slalom achieved joint inference with SGX+GPU. Changing the model obfuscation algorithm within the same architecture allows for an adaptive comparison of the efficiency between GroupCover and similar schemes, whether it's SGX+RTX2080, a regular VM+RTX3070, or a real TEE environment tested later. Thus, our simulation results hold across different environments.

    2. Concerning the impact of TEE on inference performance, we conducted tests in an actual TEE environment. Considering possible security issues with SGX and its intrusive nature on user code, we are strongly inclined to implement our scheme in a real TEE environment and test inference efficiency. Limited by machine conditions and time, we conducted partial experiments in a real TEE environment, with the below GPU nodes being simulated by 64 cores CPUs, hoping our supplementary experimental data, added code in the anonymous open-source repository can address your concerns regarding the implementation.

      • We selected Intel TDX and AMD SEV as TEE devices to implement the overall inference framework. Intel TDX and AMD SEV represent the latest in confidential computing technologies, providing hardware-isolated virtual machines (trusted VMs). Given that these two types of heterogeneous secret virtual machines are supported in real-time by manufacturers and regularly patched for security vulnerabilities, we do not consider the security issues and specific implementations of trusted VMs. These machines, are equipped with 256 CPU cores and 1T of memory. We then launched trusted VMs using qemu and set debug=False. Although trusted VMs can fully deploy larger memory capacities, to observe the delay caused by "processor context switching and memory copying," we aimed to allocate as little runtime memory to the trusted VMs as possible. To achieve this, we configured the trusted VMs with CPU=1, Memory=2G as the TEE environment and loaded them with a transformer model having 200M parameters. Upon entering the trusted VM, it was observed that only 700M of allowable runtime memory remained, potentially triggering the TEE's memory performance wall during inference operations.
      • We implemented the interaction logic between trusted VMs and GPUs using torch.rpc. Specifically, for trusted VMs and GPU nodes capable of network communication, we designated the trusted VM as the rpc master node and the GPU as the worker node. We pre-initialized linear layers in the GPU, accepted inputs from the trusted VM, and returned the results to the trusted VM after linear forwarding. This part of the interaction implementation is provided in open-source repository.
      • We implemented secure inference on real TDX and SEV machines. Based on the aforementioned rpc logic, secure inference can be achieved as long as the trusted VM-master node and another GPU-worker node can establish an rpc connection. We explored two approaches to establish rpc connections: run in Trusted VM (runv) and run in Confidential Container (runc), with specific implementation details described in the code repository's README. Considering that network communication between containers requires multiple proxy layers (including ingress, nginx, etc.), we deployed secure inference based on rpc in runv mode to achieve optimal performance. The experimental setup is illustrated below, with specific configurations available in here.
      +-----------------------------------+         +-----------------------------------+
      |Trusted VM master                  |         |GPU worker                         |
      |                             RANK=0|         |                             RANK=1|
      |CPU=1, MEM=2GB                     |         |Here is a 64C terminal             |
      |                                   |         |                                   |
      |  manage the secure inference:     |  virbr0 |  rpc.init()                       |
      |  rpc.init()                       |<------->|  wait for input:                  |
      |  while model.forward:             |         |     run linear_layer(input)       |
      |      if linear layer:             |         |     rpc.sync()                    |
      |           rpc.sync(input, layer)  |         |                                   |
      +-----------------------------------+         +-----------------------------------+
      • Experimental results prove that performing inference in a TEE environment does not significantly impact performance. The addition of an offload layer does indeed increase latency and decrease throughput but does not affect the comparison with the baseline (pure GPU inference). We assume the latency of pure GPU inference as the baseline of 1x. Given the rpc implementation of alexnet-CIFAR10 and a 200M parameter transformer-translate inference task under simulation without TEE on a 3070 physical machine, the latency of secure inference is respectively 5x and 2x. Switching to a real TEE environment, these ratios are 6x and 3x. Increasing the number of layers in the transformer model, this ratio remains essentially unchanged.

      • The inference throughput in a real TEE environment can approach the baseline solution by adjusting the distributed inference architecture. During the experiments, we found that the experiment was far from saturating the bandwidth of the network interface. Using the nload0 tool, we observed that the data rate through the virbr0 network interface during RPC calls was 110MB/s, far below the 3GB/s limit of the bridged network interface. Using htop, we found that the CPU utilization of the Trusted VM was maxed out, but the GPU utilization was lower than the baseline. Therefore, we launched multiple Trusted VMs and found that within a suitable range, the inference throughput roughly increased proportionally with the number of master nodes. Specifically, pure GPU inference throughput was about 4k, with 1 Trusted VM as the master node the throughput was 1.2k, with 2 master nodes the throughput was 2.1k, and with 3 nodes it was 2.7k. Thus, with reasonable computational resource allocation, the inference throughput of GroupCover can approach the baseline solution.

    3. We will further enhance the efficiency of the GroupCover secure inference framework and update the experimental data in the future. During tests in a real TEE environment, we discovered that the primary bottleneck for inference performance was the rpc communication. Neither the runv nor runc implementation methods could utilize read-write engines like RDMA. In the frequent interactions between master and worker nodes, the latency introduced by IO is the main factor affecting inference performance. Moving forward, we hope to start with the GPU's UVM (Unified Virtual Memory) management mechanism to explore a more reasonable IO method between Trusted VMs and GPUs. For instance, following the approach of the Nvidia H100, masked data could be read and written to memory shared by the Trusted VM and the host, and then the GPU could directly read from and write to this memory using DMA. This way, IO would occur through memory rather than the network, significantly speeding up inference efficiency. If you are interested in our work, we welcome you to continue discussing future endeavors with us.