Open apoorvemohan opened 4 months ago
nsys profile but it it did not capture GPU-side call in CC mode
Setting the CC mode to devtools
might help.
Please note the bounce-buffer requirements for non TDISP enabled systems (like Sapphire and Emerald Rapids).
The TDX-Module locks the IOMMU from allowing any access to CVM memory pages. The CVM "opens" a shared page which is accessible by the host/infra/GPUs, etc.
The CVM performs the encryption/signing of the payloads, and will then copy that payload to this "bounce buffer", and will trigger the GPU to perform the DMA copy, where it will decrypt, and process. This process is reversed in the Device-to-Host direction.
Performance is generally limited on the Host side by how fast the CPU can encrypt/decrypt. Our testing is showing many of the SPR 8480s maxing out at 4-5GByte/second.
This overhead is pronounced with smaller payloads, or operations which are more I/O bound rather than compute bound. When you have more compute to IO (like an LLM), the performance is near the theoretical non-CC speeds
@rnertney This issue seems to consider D2D copy. The data would not travel through PCIE.
@apoorvemohan Could you give more detailed information? When memSize
is small, this API would take about 2 us under non-CC environment. Do you mean on CC it would take about 3.6 us?
@rnertney This issue seems to consider D2D copy. The data would not travel through PCIE.
@apoorvemohan Could you give more detailed information? When
memSize
is small, this API would take about 2 us under non-CC environment. Do you mean on CC it would take about 3.6 us?
Currently, PeerAccess is not supported in our stack.
@rnertney This issue seems to consider D2D copy. The data would not travel through PCIE. @apoorvemohan Could you give more detailed information? When
memSize
is small, this API would take about 2 us under non-CC environment. Do you mean on CC it would take about 3.6 us?Currently, PeerAccess is not supported in our stack.
It's D2D copy on the same GPU instead of P2P.
That's what I get for trying to read too quickly :) Ill reopen and see what we see internally.
We do not expect any inter-GPU copies to have performance impact; once the data/workload are in the GPU, we expect to run at full fmax.
@rnertney It is possibly the overhead of CUDA library on the CPU. The HCC-whitepaper says,
There is also an overhead for encrypting GPU command buffers, synchronization primitives, exception metadata, and other internal driver data.
The following example evaluates a 1-byte H2D memcpy. Running it on CC and non-CC environment could show the difference. The overhead of encryption on CC environment is expected to be small because data is 1-byte.
#include <cuda_runtime.h>
#include <chrono>
#include <iostream>
#include <cstring>
#include <cstdio>
int main()
{
void *x, *y, *z;
auto size = 0x1000;
cudaMallocHost(&x, size);
cudaMalloc(&y, size);
long long sum = 0;
int i = 0;
int T = 0;
while (++T < 20) {
auto start = std::chrono::system_clock::now();
cudaMemcpyAsync(y, x, 1, cudaMemcpyHostToDevice);
auto end = std::chrono::system_clock::now();
cudaDeviceSynchronize();
sum += std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
if (i++ == 100000) {
std::cout << ' ' << sum / i << "us" << std::endl;
sum = i = 0;
}
}
return 0;
}
Hello, when performing D2D memory copy operations
e.g., cudaMemcpy(d_odata, d_idata, memSize, cudaMemcpyDeviceToDevice))
, we observe it take up to ~80% more time in CC mode compared to non-CC mode for small message size on Intel TDX + H100 system. Is this expected? Is there a way to profile the code to figure out what might be causing the slow down in CC Mode (we tried nsys profile but it it did not capture GPU-side call in CC mode)? Thanks!