Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
MIT License
290 stars 66 forks source link

Cooperative Async Copies #5

Closed FabianSchuetze closed 10 months ago

FabianSchuetze commented 10 months ago

Thanks for this wonderful repo.

I have a question about the async copies:

uint32_t A_smem_lane_addr =
   __cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;

CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);

Does this mean that every lane (thread) has a different pointer to the shared memory and a different pointer to the global memory?

The way I understand the async copies, the src and dst pointers must be the same for every thread in the thread block. See the docs.

Bruce-Lee-LY commented 10 months ago

You should refer to the docs.

FabianSchuetze commented 10 months ago

Thanks.