Open YunchaoYang opened 1 month ago
is essentially done by memory mapping. Traditional MPI using messaging passing throught network.
Parallel communication Patterns
CUDA guarantees that
Synchronization
From the perspective of the host, the implicit data transfers are blocking or synchronous transfers, while the kernel launch is asynchronous.
block meaning block the host. CUDA kernels are asynchronous.
Since all operations in non-default streams are non-blocking with respect to the host code, you will run across situations where you need to synchronize the host code with operations in a stream. There are several ways to do this. The “heavy hammer” way is to use cudaDeviceSynchronize(), which blocks the host code until all previously issued operations on the device have completed. In most cases this is overkill, and can really hurt performance due to stalling the entire device and host thread.
read more:
cuda-gdb
watch more:
The NVIDIA A100 GPU, based on the Ampere architecture, allows a substantial number of threads to be executed in parallel. Here are the key thread-related limits:
Each SM on the A100 can have up to 2048 active threads (64 warps of 32 threads).
The A100 has 108 Streaming Multiprocessors (SMs). The total maximum number of threads that can be active across all SMs on the A100 is: 2048 threads/SM × 108 SMs = 221,184 threads
This means that up to 221,184 threads can be active on the A100 at the same time.
Warp is like the loading/unloading forklift truck.
Two parties are involved: threads in a warp+ global memory
Real reason: warp will fetch a chunk of memory, try to reuse all memory in one fetch.
When threads accessing global memory, threads in a warp need to access continuous memory.
using shared memory to make sure both read and write operations are coalesced memory access.
Two parties are involved threads in a warp + shared memory
Real reason: shared memory are organized in banks. Access in a bank is one per time.
Shared memory are physicallye stored in banks. Actually Shared memory are stored in 32 4-bytes banks. Memory access in the same bank would results in the access operations being serialized, which is called bank conflict
.
Make an offset in threads idx
CUDA programming , which is essential for ML/AI optimization, is incredibly sought in the ML industry especially as we entered the LLM era. In order to make the neural network training faster and more efficient, we need to pursue various ways to efficiently exploit the potential of GPU computing power and memory bandwidth. The last several years witnessed significant brilliant research ideas on improving training and inference speed of transformers by effectively optimizing computing kernels and paradigms.
I have been trained in NVIDIA DLI instructor on Accelerated Computing with Python/C++. The teaching kit and materials are wonderful. The CUDA Python course covers the fundamentals of CUDA programming Architecture, how GPU hardware works with CUDA, and how to use Numba—the just-in-time, type-specializing Python function compiler—to create and launch CUDA kernels to accelerate Python programs on massively parallel NVIDIA GPUs. The CUDA C++ involves accelerate and optimize existing C/C++ CPU-only applications using the most essential CUDA tools and techniques.
Here I reviewed the key concepts:
General principles of efficient GPU programming
Resources: