YunchaoYang commented 1 month ago

CUDA programming , which is essential for ML/AI optimization, is incredibly sought in the ML industry especially as we entered the LLM era. In order to make the neural network training faster and more efficient, we need to pursue various ways to efficiently exploit the potential of GPU computing power and memory bandwidth. The last several years witnessed significant brilliant research ideas on improving training and inference speed of transformers by effectively optimizing computing kernels and paradigms.

I have been trained in NVIDIA DLI instructor on Accelerated Computing with Python/C++. The teaching kit and materials are wonderful. The CUDA Python course covers the fundamentals of CUDA programming Architecture, how GPU hardware works with CUDA, and how to use Numba—the just-in-time, type-specializing Python function compiler—to create and launch CUDA kernels to accelerate Python programs on massively parallel NVIDIA GPUs. The CUDA C++ involves accelerate and optimize existing C/C++ CPU-only applications using the most essential CUDA tools and techniques.

Here I reviewed the key concepts:

CUDA Threads, Blocks, and Grids
Kernels
Memory Hierarchy
Warp and Warp Divergence
Memory Coalescing
Synchronization
Streams and Asynchronous Execution

General principles of efficient GPU programming

maximize parallelism:
- utilize many threads
- balance workload
optimize memory access
- Coalesce memory accesses
- Use shared memory effectively
- Minimize global memory usage
- Avoid Bank conflicts

Resources:

NVIDIA DLI: Fundamentals of Accelerated Computing with CUDA Python/C++
****[CUDA Mode Lecture series] Notes
CUDA Mode Lecture 1 How to profile CUDA kernels in PyTorch and lecture notes
*****Intro-level udacity CS344 and GitHub

YunchaoYang commented 1 month ago

Parallel communication

is essentially done by memory mapping. Traditional MPI using messaging passing throught network.

Parallel communication Patterns

Map: data element <-> task(thread): One to One
Gather: each thread read from more than one element. Many to One
Scatter: each thread write to more than one element , potential risk: several threads write to same location at the same time (race). One to Many
Stencil: teaks read input from a fixed neighborhood in an array: 2D von neumann, 2D Moore, 3D von neumenann. Several to One
Transpose: reorder data element in the memory. Not specific to matrix transpose. example: row major order to column major order (scatter), eg2: array of structures -> structures of arrays. One to One
Reduce: all to one
Scan/Sort: All to All

YunchaoYang commented 1 month ago

CUDA guarantees that

All threads in a block run on the same SM at the same time
All blocks in a kernel finish before any block from the next kernel run

YunchaoYang commented 1 month ago

Synchronization

Barrier
__syncthreads: place a barrier between threads in a threads block: for shared memory
implicit barriers between kernels: global memory

YunchaoYang commented 1 month ago

CUDA stream

default stream

From the perspective of the host, the implicit data transfers are blocking or synchronous transfers, while the kernel launch is asynchronous.

non-default stream

block meaning block the host. CUDA kernels are asynchronous.

cudaMemcpyAsync() is non-blocking on the host

Synchronization with streams

Since all operations in non-default streams are non-blocking with respect to the host code, you will run across situations where you need to synchronize the host code with operations in a stream. There are several ways to do this. The “heavy hammer” way is to use cudaDeviceSynchronize(), which blocks the host code until all previously issued operations on the device have completed. In most cases this is overkill, and can really hurt performance due to stalling the entire device and host thread.

CUDA Debug

printf works inside kernel function
using cuda-gdb

CUDA Optimization

watch more:

YunchaoYang commented 1 month ago

What is Maximum Number of Threads Available on NVIDIA A100?

The NVIDIA A100 GPU, based on the Ampere architecture, allows a substantial number of threads to be executed in parallel. Here are the key thread-related limits:

Threads per Block: The maximum number of threads per block is 1024.
Threads per Warp: A warp consists of 32 threads.
Thread Blocks per Multiprocessor (SM): The A100 has up to 16 concurrent thread blocks per Streaming Multiprocessor (SM).
Maximum Threads per Multiprocessor (SM):

Each SM on the A100 can have up to 2048 active threads (64 warps of 32 threads).

Total Number of Threads:

The A100 has 108 Streaming Multiprocessors (SMs). The total maximum number of threads that can be active across all SMs on the A100 is: 2048 threads/SM × 108 SMs = 221,184 threads

This means that up to 221,184 threads can be active on the A100 at the same time.

Warp is like the loading/unloading forklift truck. CUDA-warp drawio

YunchaoYang commented 1 month ago

Avoid memory access issues

Coalesced (Global) memory access

Two parties are involved: threads in a warp+ global memory

Real reason: warp will fetch a chunk of memory, try to reuse all memory in one fetch.

When threads accessing global memory, threads in a warp need to access continuous memory.

Matrix transposed example:

using shared memory to make sure both read and write operations are coalesced memory access.

Memory Bank Conflicts:

Two parties are involved threads in a warp + shared memory

Real reason: shared memory are organized in banks. Access in a bank is one per time.

Shared memory are physicallye stored in banks. Actually Shared memory are stored in 32 4-bytes banks. Memory access in the same bank would results in the access operations being serialized, which is called bank conflict.

resolve Bank Conflicts

Make an offset in threads idx

YunchaoYang / Blogs

CUDA programming #76

General principles of efficient GPU programming

Resources:

Parallel communication

CUDA stream

default stream

non-default stream

Synchronization with streams

CUDA Debug

CUDA Optimization

What is Maximum Number of Threads Available on NVIDIA A100?

Avoid memory access issues

Coalesced (Global) memory access

Matrix transposed example:

Memory Bank Conflicts:

resolve Bank Conflicts