YunchaoYang / Blogs

blogs and notes, https://yunchaoyang.github.io/blogs/
0 stars 0 forks source link

CUDA programming #76

Open YunchaoYang opened 2 days ago

YunchaoYang commented 2 days ago

CUDA programming , which is essential for ML/AI optimization, is incredibly sought in the ML industry especially as we entered the LLM era. In order to make the neural network training faster and more efficient, we need to pursue various ways to efficiently exploit the potential of GPU computing power and memory bandwidth. The last several years witnessed significant brilliant research ideas on improving training and inference speed of transformers by effectively optimizing computing kernels and paradigms.

I have been trained in NVIDIA DLI instructor on Accelerated Computing with Python/C++. The teaching kit and materials are wonderful. The CUDA Python course covers the fundamentals of CUDA programming Architecture, how GPU hardware works with CUDA, and how to use Numba—the just-in-time, type-specializing Python function compiler—to create and launch CUDA kernels to accelerate Python programs on massively parallel NVIDIA GPUs. The CUDA C++ involves accelerate and optimize existing C/C++ CPU-only applications using the most essential CUDA tools and techniques.

Here I reviewed the key concepts:

General principles of efficient GPU programming

  1. maximize parallelism:
    • utilize many threads
    • balance workload
  2. optimize memory access
    • Coalesce memory accesses
    • Use shared memory effectively
    • Minimize global memory usage
    • Avoid Bank conflicts

Resources:

  1. NVIDIA DLI: Fundamentals of Accelerated Computing with CUDA Python/C++
  2. ****[CUDA Mode Lecture series] Notes
  3. CUDA Mode Lecture 1 How to profile CUDA kernels in PyTorch and lecture notes
  4. *****Intro-level udacity CS344 and GitHub
YunchaoYang commented 2 days ago

Parallel communication

is essentially done by memory mapping. Traditional MPI using messaging passing throught network.

Parallel communication Patterns

YunchaoYang commented 2 days ago

CUDA guarantees that

  1. All threads in a block run on the same SM at the same time
  2. All blocks in a kernel finish before any block from the next kernel run
YunchaoYang commented 2 days ago

Synchronization

  1. Barrier
  2. __syncthreads: place a barrier between threads in a threads block: for shared memory
  3. implicit barriers between kernels: global memory
YunchaoYang commented 1 day ago

CUDA stream

default stream

From the perspective of the host, the implicit data transfers are blocking or synchronous transfers, while the kernel launch is asynchronous.

non-default stream

block meaning block the host. CUDA kernels are asynchronous.

Synchronization with streams

Since all operations in non-default streams are non-blocking with respect to the host code, you will run across situations where you need to synchronize the host code with operations in a stream. There are several ways to do this. The “heavy hammer” way is to use cudaDeviceSynchronize(), which blocks the host code until all previously issued operations on the device have completed. In most cases this is overkill, and can really hurt performance due to stalling the entire device and host thread.

read more:

  1. article How to Overlap Data Transfers in CUDA C/C++
  2. CUDA Developer Tools | Performance Analysis with NVIDIA Nsight Systems Timeline
YunchaoYang commented 1 day ago

CUDA Debug

  1. printf works inside kernel function
  2. using cuda-gdb

CUDA Optimization

watch more:

YunchaoYang commented 1 day ago

What is Maximum Number of Threads Available on NVIDIA A100?

The NVIDIA A100 GPU, based on the Ampere architecture, allows a substantial number of threads to be executed in parallel. Here are the key thread-related limits:

Each SM on the A100 can have up to 2048 active threads (64 warps of 32 threads).

The A100 has 108 Streaming Multiprocessors (SMs). The total maximum number of threads that can be active across all SMs on the A100 is: 2048 threads/SM × 108  SMs = 221,184 threads

This means that up to 221,184 threads can be active on the A100 at the same time.

Warp is like the loading/unloading forklift truck. CUDA-warp drawio

YunchaoYang commented 1 day ago

Avoid memory access issues

Coalesced (Global) memory access

Two parties are involved: threads in a warp+ global memory

Real reason: warp will fetch a chunk of memory, try to reuse all memory in one fetch.

When threads accessing global memory, threads in a warp need to access continuous memory. image

Matrix transposed example:

using shared memory to make sure both read and write operations are coalesced memory access.

Memory Bank Conflicts:

Two parties are involved threads in a warp + shared memory

Real reason: shared memory are organized in banks. Access in a bank is one per time.

Shared memory are physicallye stored in banks. Actually Shared memory are stored in 32 4-bytes banks. Memory access in the same bank would results in the access operations being serialized, which is called bank conflict.

image

resolve Bank Conflicts

Make an offset in threads idx