fixstars / lightning-kit

Reactive data transfer framework
1 stars 0 forks source link

doca udp frame builder performance improvement #12

Open masaruito110 opened 5 months ago

masaruito110 commented 5 months ago


Based on 7de28d5663556a789b1366660e2bd53b250f69b1 I measured the performance of docagpunetio.

Current server structure

|doca flow| ---------> |frame builder|

frame builder structure

|receive_tcp|<--semaphore-->|makeframe|<--semaphore-->|notify frame built|


Environment is the same as


The difference with is that we cannot set chunk size because udp doesn't check ack. Hence, we just show maximum performance. The trend looks the same as

env process session/process Gbps/session
1 1 1 43
2 2 1 18
masaruito110 commented 5 months ago

I run the simple app that heavily copy while frame builder runs.

The result is similar to that of multiple sessions.

env process session/process Gbps/session
1 1 1 18

Doca seems to be influenced by other heavily copy kernels.

Simple app is below.

File Edit Options Buffers Tools C++ Help
#include <cuda_runtime.h>
#include <stdint.h>
#include <stdio.h>

__global__ void heavy_memcpy(uint8_t* dst, uint8_t* src, size_t chunk, size_t frame_size)

    size_t cnt = 0;
    while (true) {
        if (cnt % 1000 && threadIdx.x == 0) {
            printf("copying %d\n", cnt);
        for (int i = threadIdx.x; i < frame_size / chunk - 1; i += blockDim.x) {
            cudaMemcpyAsync(dst + i * chunk, src + i * chunk, chunk, cudaMemcpyDeviceToDevice);

void heavy_memcpy_cpu()
    uint8_t* dst;
    uint8_t* src;

    size_t frame_size = (size_t)4 * 1024 * 1024 * 1024;
    size_t chunk = 8000;

    cudaMalloc((void**)&dst, frame_size);
    cudaMalloc((void**)&src, frame_size);

    heavy_memcpy<<<1, 1024>>>(dst, src, chunk, frame_size);