YunchaoYang commented 1 month ago

Key Takeaway:

GPU cares about parallelism
GPU cares about throughput (bandwidth) instead of latency for single thread (task(=)
Memory hierarchy

AX+Y demo

With memory bandwith of 131 GB/sec

latency: move 11659 bytes per 11659/131 = 89 ns latency

Concurrency

Need to have 729 iterations . 729 computation load

Parallel

Use parallel For the start the threads.

Available threads are more than required threads (5.6x).

GPU cache

GPU threads

SM memory

64 warps/SM
64K 4-bytes registers per scheduler
192 KB L1/shared memory.

YunchaoYang commented 1 month ago

GPU 架构与 CUDA 关系

GPU 线程分级

计算强度（Arithmetic Intensity）是指在执行计算任务时所需的算术运算量与数据传输量之比。它是衡量计算任务的计算密集程度的重要指标，可以帮助评估算法在不同硬件上的性能表现。通过计算强度，可以更好地理解计算任务的特性，有助于选择合适的优化策略和硬件配置，以提高计算任务的性能表现。计算强度的公式如下：

计算强度 = 算术运算量/数据传输量

Warp Scheduler

Warp 是线程束，逻辑上所有 Thread 并行执行，但是从硬件的角度讲并不是所有的 Thread 能够在同一时刻执行，因此引入 Warp。Warp 是 SM 基本执行单元，一个 Warp 包含 32 个并行 Thread（warp_size=32），这 32 个 Thread 执行 SIMT（Single Instruction Multiple Thread）指令模式。

也就是说，所有的 Thread 以锁步的方式执行同一条指令，但是每个 Thread 会使用各自的 Data 执行指令分支。

如果在 Warp 中没有 32 个 Thread 需要工作，那么 Warp 虽然还是作为一个整体运行，但这部分 Thread 是处于非激活状态。此外，Thread 是最小的逻辑单位，Warp 是硬件执行单位。

In GPU CUDA programming, when multiple threads are executed within a block, they are theoretically synchronized at certain points (like when using explicit synchronization methods such as __syncthreads()). However, this synchronization is not automatic for every step of the computation. Physically, threads within a block are divided into smaller units called warps, and these warps are executed independently.

A warp typically consists of 32 threads that are executed in lockstep. This means that all 32 threads in a warp execute the same instruction at the same time, but on different data.

Number of Threads per Block: a multiple of the warp size, which is 32 threads. 128, 256, 512, or 1024 threads per block
Number of Blocks per Grid: depends on several factors. 1. problem size, if large dataset (with millions of elements), need to have more blocks. Also depends on hardware and dynamic parallelism.
How many warps can be launcher per SM? The number of warps that can be launched and executed concurrently on a Streaming Multiprocessor (SM) in a GPU depends on the GPU architecture. Each SM has hardware limitations, including the number of warps, threads, registers, and shared memory resources it can support.

Key Factors:

Warp size: A warp consists of 32 threads on NVIDIA GPUs.
Maximum warps per SM: This depends on the specific GPU architecture (e.g., Volta, Ampere, Turing, etc.), as each generation has different hardware capabilities. Typical Limits:

Volta (V100):

Maximum of 64 warps per SM.
This results in a total of 64 warps × 32 threads/warp = 2048 threads per SM.

Ampere (A100):

Maximum of 64 warps per SM, same as Volta.
This also allows up to 2048 threads per SM.

Example Calculations:

On an Ampere A100 GPU:

Each SM can accommodate up to 64 warps, or 2048 threads.
If each block has 256 threads (or 8 warps), you could have up to 8 blocks active concurrently in an SM, depending on resource usage.

CUDA libraries

CUDA 在软件方面由一个 CUDA 库、一个应用程序编程接口（API）及其运行库（Runtime）、两个较高级别的通用数学库，即 CUFFT 和 CUBLAS 组成。

CUDA TOOLKIT 包括编译和 C++核，CUDA DRIVER 驱动 GPU 负责内存和图像管理。CUDA-X LIBRARIES 主要提供了机器学习（Meachine Learning）、深度学习（Deep Learning）和高性能（High Performance Computing）计算方面的加速库，APPS & FRAMEWORKS 主要对接 TensorFlow 和 Pytorch 等框架。

CUDA Threads 线程层次结构

CUDA 最基本的执行单位是线程（Thread），图中每条曲线可视为单个线程，大的网格（Grid）被切分成小的网格，其中包含了很多相同线程数量的块（Block），每个块中的线程独立执行，可以通过本地数据共享实现数据交换同步。因此对于 CUDA 来讲，就可以将问题划分为独立线程块，并行解决的子问题，子问题划分为可以由块内线程并行协作解决。

CUDA 引入主机端（host）和设备（device）概念，CUDA 程序中既包含主机（host）程序也包含设备（device）程序，host 和 device 之间可以进行通信，以此来实现数据拷贝，主机负责管理数据和控制程序流程，设备负责执行并行计算任务。在 CUDA 编程中，Kernel 是在 GPU 上并行执行的函数，开发人员编写 Kernel 来描述并行计算任务，然后在主机上调用 Kernel 来在 GPU 上执行计算。

kernel 用 global 符号声明，在调用时需要用 <<<grid, block>>> 来指定 kernel 要执行及结构。代码 cuda_device.cu 是使用 CUDA 编程实现 GPU 计算，代码涉及到 host（CPU）和 device（GPU）相关计算，使用 global 声明将 add 函数转变为 GPU 可执行的 kernel。

为了实现以上并行计算，对应于 GPU 硬件在进行实际计算过程时，CUDA 可以分为 Grid，Block 和 Thread 三个层次结构：

线程层次结构Ⅰ-Grid：kernel 在 device 上执行时，实际上是启动很多线程，一个 kernel 所启动的所有线程称为一个网格（grid），同一个网格上的线程共享相同的全局内存空间，grid 是线程结构的第一层次。

线程层次结构Ⅱ-Block：Grid 分为多个线程块（block），一个 block 里面包含很多线程，Block 之间并行执行，并且无法通信，也没有执行顺序，每个 block 包含共享内存（shared memory），可以共享里面的 Thread。

线程层次结Ⅲ-Thread：CUDA 并行程序实际上会被多个 threads 执行，多个 threads 会被群组成一个线程 block，同一个 block 中 threads 可以同步，也可以通过 shared memory 通信。

因此 CUDA 和英伟达硬件架构有以下对应关系，从软件侧看到的是线程 (thread) 的执行，对应于硬件上的 CUDA Core，每个线程对应于 CUDA Core，软件方面线程数量是超配的，硬件上 CUDA Core 是固定数量的。 Block 线程块只在一个 SM 上通过 Warp 进行调度，一旦在 SM 上调用了 Block 线程块，就会一直保留到执行完 kernel，SM 可以同时保存多个 Block 线程块，多个 SM 组成的 TPC 和 GPC 硬件实现了 GPU 并行计算。

3 concepts

threads	CUDA core	kernel
logical computing unit	hardware unit	defined function on GPU

YunchaoYang commented 1 month ago

original: https://github.com/chenzomi12/AISystem/blob/main/02Hardware/03GPUBase/03Concept.md

YunchaoYang / Blogs

GPU threads and cache #75

Key Takeaway:

AX+Y demo

Concurrency

Parallel

GPU cache

GPU threads

SM memory

GPU 架构与 CUDA 关系

GPU 线程分级

Warp Scheduler

CUDA libraries

CUDA Threads 线程层次结构