fixstars / lightning-kit

Reactive data transfer framework
Other
1 stars 0 forks source link

doca udp frame builder performance improvement #12

Open masaruito110 opened 5 months ago

masaruito110 commented 5 months ago

Purpose

Based on 7de28d5663556a789b1366660e2bd53b250f69b1 I measured the performance of docagpunetio.

Current server structure

|doca flow| ---------> |frame builder|

frame builder structure

|receive_tcp|<--semaphore-->|makeframe|<--semaphore-->|notify frame built|

Environment

Environment is the same as https://github.com/fixstars/lightning-kit/issues/10

Result

The difference with https://github.com/fixstars/lightning-kit/issues/10 is that we cannot set chunk size because udp doesn't check ack. Hence, we just show maximum performance. The trend looks the same as https://github.com/fixstars/lightning-kit/issues/10

env process session/process Gbps/session
1 1 1 43
2 2 1 18
masaruito110 commented 5 months ago

I run the simple app that heavily copy while frame builder runs.

The result is similar to that of multiple sessions.

env process session/process Gbps/session
1 1 1 18

Doca seems to be influenced by other heavily copy kernels.

Simple app is below.

File Edit Options Buffers Tools C++ Help
#include <cuda_runtime.h>
#include <stdint.h>
#include <stdio.h>

__global__ void heavy_memcpy(uint8_t* dst, uint8_t* src, size_t chunk, size_t frame_size)
{

    size_t cnt = 0;
    while (true) {
        cnt++;
        if (cnt % 1000 && threadIdx.x == 0) {
            printf("copying %d\n", cnt);
        }
        for (int i = threadIdx.x; i < frame_size / chunk - 1; i += blockDim.x) {
            cudaMemcpyAsync(dst + i * chunk, src + i * chunk, chunk, cudaMemcpyDeviceToDevice);
        }
    }
}

void heavy_memcpy_cpu()
{
    uint8_t* dst;
    uint8_t* src;

    size_t frame_size = (size_t)4 * 1024 * 1024 * 1024;
    size_t chunk = 8000;

    cudaMalloc((void**)&dst, frame_size);
    cudaMalloc((void**)&src, frame_size);

    heavy_memcpy<<<1, 1024>>>(dst, src, chunk, frame_size);
    cudaDeviceSynchronize();
}