Open masaruito110 opened 5 months ago
I run the simple app that heavily copy while frame builder runs.
The result is similar to that of multiple sessions.
env | process | session/process | Gbps/session |
---|---|---|---|
1 | 1 | 1 | 18 |
Doca seems to be influenced by other heavily copy kernels.
Simple app is below.
File Edit Options Buffers Tools C++ Help
#include <cuda_runtime.h>
#include <stdint.h>
#include <stdio.h>
__global__ void heavy_memcpy(uint8_t* dst, uint8_t* src, size_t chunk, size_t frame_size)
{
size_t cnt = 0;
while (true) {
cnt++;
if (cnt % 1000 && threadIdx.x == 0) {
printf("copying %d\n", cnt);
}
for (int i = threadIdx.x; i < frame_size / chunk - 1; i += blockDim.x) {
cudaMemcpyAsync(dst + i * chunk, src + i * chunk, chunk, cudaMemcpyDeviceToDevice);
}
}
}
void heavy_memcpy_cpu()
{
uint8_t* dst;
uint8_t* src;
size_t frame_size = (size_t)4 * 1024 * 1024 * 1024;
size_t chunk = 8000;
cudaMalloc((void**)&dst, frame_size);
cudaMalloc((void**)&src, frame_size);
heavy_memcpy<<<1, 1024>>>(dst, src, chunk, frame_size);
cudaDeviceSynchronize();
}
Purpose
Based on 7de28d5663556a789b1366660e2bd53b250f69b1 I measured the performance of docagpunetio.
Current server structure
Environment
Environment is the same as https://github.com/fixstars/lightning-kit/issues/10
Result
The difference with https://github.com/fixstars/lightning-kit/issues/10 is that we cannot set chunk size because udp doesn't check ack. Hence, we just show maximum performance. The trend looks the same as https://github.com/fixstars/lightning-kit/issues/10