NVIDIA / nvtrust

Ancillary open source software to support confidential computing on NVIDIA GPUs
Apache License 2.0
175 stars 27 forks source link

Intel TDX + H100 D2D cudaMemCpy Overhead #42

Open apoorvemohan opened 4 months ago

apoorvemohan commented 4 months ago

Hello, when performing D2D memory copy operations e.g., cudaMemcpy(d_odata, d_idata, memSize, cudaMemcpyDeviceToDevice)), we observe it take up to ~80% more time in CC mode compared to non-CC mode for small message size on Intel TDX + H100 system. Is this expected? Is there a way to profile the code to figure out what might be causing the slow down in CC Mode (we tried nsys profile but it it did not capture GPU-side call in CC mode)? Thanks!

Tan-YiFan commented 4 months ago

nsys profile but it it did not capture GPU-side call in CC mode

Setting the CC mode to devtools might help.

rnertney commented 2 weeks ago

Please note the bounce-buffer requirements for non TDISP enabled systems (like Sapphire and Emerald Rapids).

The TDX-Module locks the IOMMU from allowing any access to CVM memory pages. The CVM "opens" a shared page which is accessible by the host/infra/GPUs, etc.

The CVM performs the encryption/signing of the payloads, and will then copy that payload to this "bounce buffer", and will trigger the GPU to perform the DMA copy, where it will decrypt, and process. This process is reversed in the Device-to-Host direction.

Performance is generally limited on the Host side by how fast the CPU can encrypt/decrypt. Our testing is showing many of the SPR 8480s maxing out at 4-5GByte/second.

This overhead is pronounced with smaller payloads, or operations which are more I/O bound rather than compute bound. When you have more compute to IO (like an LLM), the performance is near the theoretical non-CC speeds

Tan-YiFan commented 2 weeks ago

@rnertney This issue seems to consider D2D copy. The data would not travel through PCIE.

@apoorvemohan Could you give more detailed information? When memSize is small, this API would take about 2 us under non-CC environment. Do you mean on CC it would take about 3.6 us?

rnertney commented 2 weeks ago

@rnertney This issue seems to consider D2D copy. The data would not travel through PCIE.

@apoorvemohan Could you give more detailed information? When memSize is small, this API would take about 2 us under non-CC environment. Do you mean on CC it would take about 3.6 us?

Currently, PeerAccess is not supported in our stack.

Tan-YiFan commented 2 weeks ago

@rnertney This issue seems to consider D2D copy. The data would not travel through PCIE. @apoorvemohan Could you give more detailed information? When memSize is small, this API would take about 2 us under non-CC environment. Do you mean on CC it would take about 3.6 us?

Currently, PeerAccess is not supported in our stack.

It's D2D copy on the same GPU instead of P2P.

rnertney commented 2 weeks ago

That's what I get for trying to read too quickly :) Ill reopen and see what we see internally.

rnertney commented 2 weeks ago

We do not expect any inter-GPU copies to have performance impact; once the data/workload are in the GPU, we expect to run at full fmax.

Tan-YiFan commented 2 weeks ago

@rnertney It is possibly the overhead of CUDA library on the CPU. The HCC-whitepaper says,

There is also an overhead for encrypting GPU command buffers, synchronization primitives, exception metadata, and other internal driver data.

The following example evaluates a 1-byte H2D memcpy. Running it on CC and non-CC environment could show the difference. The overhead of encryption on CC environment is expected to be small because data is 1-byte.

#include <cuda_runtime.h>
#include <chrono>
#include <iostream>
#include <cstring>
#include <cstdio>

int main()
{
    void *x, *y, *z;
    auto size = 0x1000;
    cudaMallocHost(&x, size);
    cudaMalloc(&y, size);
    long long sum = 0;
    int i = 0;
    int T = 0;
    while (++T < 20) {
        auto start = std::chrono::system_clock::now();
        cudaMemcpyAsync(y, x, 1, cudaMemcpyHostToDevice);
        auto end = std::chrono::system_clock::now();
        cudaDeviceSynchronize();
        sum += std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        if (i++ == 100000) {
            std::cout << ' ' << sum / i << "us" << std::endl;
            sum = i = 0;
        }
    }
    return 0;
}