共享内存在矩阵优化中导致结果不同

muyuuuu commented 2 weeks ago

问题一

共享内存有两种创建方式，示例用的静态，我寻思用动态试试，锻炼一下自己，结果就不一样的，能请教一下吗？

#include <stdio.h>
#include <cuda_runtime.h>

#define CEIL_DIV(M, N) (((M) + (N) - 1) / (N))

void sgemm_naive_cpu(float *A, float *B, float *C, int M, int N, int K)
{
    for (int x = 0; x < M; x++)
    {
        for (int y = 0; y < N; y++)
        {
            float sum = 0.0f;
            for (int i = 0; i < K; i++)
            {
                sum += A[x * K + i] * B[i * N + y];
            }
            C[x * N + y] = sum;
        }
    }
}

template <typename T>
__device__ T *SharedMemoryProxy()
{
    extern __shared__ unsigned char memory[];
    return reinterpret_cast<T *>(memory);
}

template <const int BLOCKSIZE>
__global__ void sgemm_shared_mem_kernel(float *A, float *B, float *C, int M, int N, int K)
{
    // the output block that we want to compute in this threadblock
    const uint c_row = blockIdx.x;
    const uint c_col = blockIdx.y;

    // allocate shared memory for the input and output submatrices
    // __shared__ float A_shared[BLOCKSIZE * BLOCKSIZE];
    // __shared__ float B_shared[BLOCKSIZE * BLOCKSIZE];

    auto A_shared = SharedMemoryProxy<float>();
    auto B_shared = SharedMemoryProxy<float>();

    // the inner row & col that we're accessing in this thread
    const uint thread_row = threadIdx.x / BLOCKSIZE;
    const uint thread_col = threadIdx.x % BLOCKSIZE;

    // advance pointers to the starting positions
    A += c_row * BLOCKSIZE * K;
    B += c_col * BLOCKSIZE;
    C += c_row * BLOCKSIZE * N + c_col * BLOCKSIZE;

    float tmp = 0.0f;
    for (int i = 0; i < K; i += BLOCKSIZE)
    {
        // load the next block of the input matrices into shared memory
        A_shared[thread_row * BLOCKSIZE + thread_col] = A[thread_row * K + thread_col];
        B_shared[thread_row * BLOCKSIZE + thread_col] = B[thread_row * N + thread_col];

        // wait for all threads to finish loading
        __syncthreads();

        // compute the partial sum
        for (int j = 0; j < BLOCKSIZE; j++)
        {
            tmp += A_shared[thread_row * BLOCKSIZE + j] * B_shared[j * BLOCKSIZE + thread_col];
        }

        // wait for all threads to finish computing
        __syncthreads();

        // advance the pointers
        A += BLOCKSIZE;
        B += BLOCKSIZE * N;
    }

    C[thread_row * N + thread_col] = tmp;
}

void run_sgemm_shared_memory(float *A, float *B, float *C, int m, int n, int k)
{
    const int BLOCKSIZE = 32;
    dim3 block_size(BLOCKSIZE * BLOCKSIZE);
    dim3 grid_size(CEIL_DIV(m, BLOCKSIZE), CEIL_DIV(n, BLOCKSIZE));
    sgemm_shared_mem_kernel<BLOCKSIZE><<<grid_size, block_size, sizeof(float) * BLOCKSIZE * BLOCKSIZE>>>(A, B, C, m, n, k);
}

void randomize_matrix(float *mat, int N)
{
    for (int i = 0; i < N; i++)
    {
        mat[i] = rand() % 100;
    }
}

int main()
{
    int m = 256;
    int n = 256;
    int k = 512;

    // Allocate memory for matrices
    float *A, *B, *C, *C_ref;
    float *d_A, *d_B, *d_C;

    A = new float[m * k];
    B = new float[k * n];
    C = new float[m * n];
    // save reference result
    C_ref = new float[m * n];

    // Initialize matrices
    randomize_matrix(A, m * k);
    randomize_matrix(B, k * n);

    // Allocate device memory
    cudaMalloc((void **)&d_A, m * k * sizeof(float));
    cudaMalloc((void **)&d_B, k * n * sizeof(float));
    cudaMalloc((void **)&d_C, m * n * sizeof(float));

    // Copy data to device
    cudaMalloc((void **)&d_A, m * k * sizeof(float));
    cudaMalloc((void **)&d_B, k * n * sizeof(float));
    cudaMalloc((void **)&d_C, m * n * sizeof(float));

    // Copy matrices to device
    cudaMemcpy(d_A, A, m * k * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, k * n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, C, m * n * sizeof(float), cudaMemcpyHostToDevice);

    run_sgemm_shared_memory(d_A, d_B, d_C, m, n, k);

    // Copy result to host
    cudaMemcpy(C, d_C, m * n * sizeof(float), cudaMemcpyDeviceToHost);

    // Run reference sgemm
    sgemm_naive_cpu(A, B, C_ref, m, n, k);

    // Verify result
    for (int i = 0; i < m * n; i++)
    {
        if (C[i] != C_ref[i])
        {
            printf("Error: mismatch at index %d, expected %f, got %f\n", i, C_ref[i], C[i]);
            return 1;
        }
    }

    printf("Success!\n");
    return 0;
}

问题二

即使使用了动态内存优化，访问全局内存的次数和朴素实现也是一样的。也就是和朴素实现内存访问次数一样的基础上，又加了同步、共享内存访问的开销，为啥快了，这里确实不太理解。

我自己的动态内存优化实现比朴素实现确实慢了一倍，有点迷惑。代码链接：https://github.com/muyuuuu/CUFX/blob/develop/CUFX/src/gemm/src/gemm.cu#L33-L67

如果想运行一下的话，退到根目录：

bash scripts/build.sh
bash scripts/run.sh

是不是因为 Block 是二维的，导致 cache 命中率低？即使改成一维的 blcok，也勉强和朴素持平

AndSonder commented 2 weeks ago

SharedMemoryProxy() 函数每次调用都会返回相同的共享内存起始地址。这意味着 A_shared 和 B_shared 实际上指向了同一块共享内存区域。当对其中一个进行写操作时，会覆盖另一个的数据。

需要手动管理共享内存，以确保 A_shared 和 B_shared 指向不同的内存区域

template <const int BLOCKSIZE>
__global__ void sgemm_shared_mem_kernel(float *A, float *B, float *C, int M, int N, int K)
{
    // 确定当前线程块计算的输出块
    const uint c_row = blockIdx.x;
    const uint c_col = blockIdx.y;

    // 声明动态共享内存
    extern __shared__ float shared_mem[];
    float* A_shared = shared_mem;
    float* B_shared = &shared_mem[BLOCKSIZE * BLOCKSIZE];

    // 计算线程在块内的行和列
    const uint thread_row = threadIdx.x / BLOCKSIZE;
    const uint thread_col = threadIdx.x % BLOCKSIZE;

    // 其余代码保持不变
    // ...
}

此外，需要为 A_shared 和 B_shared 都分配足够的共享内存。在调用核函数时，第三个参数（共享内存的大小）应调整为两者的总和。

void run_sgemm_shared_memory(float *A, float *B, float *C, int m, int n, int k)
{
    const int BLOCKSIZE = 32;
    dim3 block_size(BLOCKSIZE * BLOCKSIZE);
    dim3 grid_size(CEIL_DIV(m, BLOCKSIZE), CEIL_DIV(n, BLOCKSIZE));
    size_t shared_mem_size = sizeof(float) * BLOCKSIZE * BLOCKSIZE * 2; // 为 A_shared 和 B_shared 各分配空间
    sgemm_shared_mem_kernel<BLOCKSIZE><<<grid_size, block_size, shared_mem_size>>>(A, B, C, m, n, k);
}

问题二你是想问共享内存为什么没提升？我感觉你可以对比对比你的写法和本仓库里面的写法

muyuuuu commented 2 weeks ago

问题一：理解了，感谢～

问题二：抛开代码的话，理论角度上为啥共享内存会提升速度呢？朴素实现和共享内存的实现，访问全局内存的次数是一样的。此外，共享内存还多了共享内存的访问、线程同步，理论上比朴素实现的开销要大呀。

AndSonder commented 2 weeks ago

问题一：理解了，感谢～

问题二：抛开代码的话，理论角度上为啥共享内存会提升速度呢？朴素实现和共享内存的实现，访问全局内存的次数是一样的。此外，共享内存还多了共享内存的访问、线程同步，理论上比朴素实现的开销要大呀。

一般来说，使用共享内存后总的内存访问次数会减少，特别是对全局内存的访问次数会显著减少。共享内存缓存了一部分数据，使得多个线程可以重复使用这些缓存的数据，而不需要每次都从全局内存重新加载，我不太清楚你是如何推出他们访问全局内存的次数是一样的

矩阵乘法（GEMM）中，计算过程中对每个数据块（如A矩阵的一行或B矩阵的一列）会多次访问。如果直接从全局内存读取，每次使用这些数据都需要重新从全局内存中加载。使用共享内存后，这些数据被加载到共享内存中，之后可以在共享内存中被多个线程重复使用，从而减少对全局内存的访问

muyuuuu commented 2 weeks ago

我不太清楚你是如何推出他们访问全局内存的次数是一样的

是我理解错了，假设 src1[M, K], src2[K, N]。

错误版本：如果计算得到 [M, N] 的矩阵，对于 dst 矩阵的每个点，都需要读 src1 的一行和 src2 的一列，所以不管是否使用共享内存优化， IO 量是一样的。
正确版本，根据 kernel 函数算 IO 次数，乘以 block 的大小，在乘以每个 block 内线程的数量，动态内存的访问忽略不记的话，的确比朴素实现的 IO 少很多。

感谢

muyuuuu commented 2 weeks ago

以及时间问题，我把矩阵放大一些，共享内存提速就明显了，可能是之前的矩阵太小，完结撒花~

PaddleJitLab / CUDATutorial

共享内存在矩阵优化中导致结果不同 #53

问题一

问题二