Closed chufanchen closed 5 months ago
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20; // 1M elements
float *x = new float[N];
float *y = new float[N];
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the CPU
add(N, x, y);
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
delete [] x;
delete [] y;
return 0;
}
I want to get this computation running (in parallel) on the many cores of a GPU.
add
function to a function that the GPU can run, called a kernel in CUDA.To do this, We add the specifier __global__
to the function, which tells the CUDA C++ compiler that this is a function that runs on the GPU and can be called from CPU code.
// CUDA Kernel function to add the elements of two arrays on the GPU
__global__
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
These __global__
functions are known as kernels, and code that runs on the GPU is often called device code, while code that runs on the CPU is host code.
__global__
- Runs on the GPU, called from the CPU or the GPU*. Executed with <<<dim3>>>
arguments. (__global__
functions can be called from other __global__
functions)__device__
- Runs on the GPU, called from the GPU. Can be used with variabiles too.__host__
- Runs on the CPU, called from the CPU.new
in the code above with calls to cudaMallocManaged()
, and replace calls to delete []
with calls to cudaFree
.// Allocate Unified Memory -- accessible from CPU or GPU
float *x, *y;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
...
// Free memory
cudaFree(x);
cudaFree(y);
Side Note:
After CUDA 6, NVIDIA introduces Unified Memory. Unified Memory creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.
add()
kernel, which invokes it on the GPU. CUDA kernel launches are specified using the triple angle bracket syntax <<< >>>
.add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
~$ nvcc cuda_test.cu -o add_cuda
~$ nvprof ./add_cuda
==3260685== NVPROF is profiling process 3260685, command: ./add_cuda
Max error: 0
==3260685== Profiling application: ./add_cuda
==3260685== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 174.22ms 1 174.22ms 174.22ms 174.22ms add(int, float*, float*)
API calls: 41.92% 176.25ms 2 88.126ms 30.539us 176.22ms cudaMallocManaged
41.44% 174.22ms 1 174.22ms 174.22ms 174.22ms cudaDeviceSynchronize
16.51% 69.411ms 1 69.411ms 69.411ms 69.411ms cudaLaunchKernel
0.09% 378.67us 2 189.33us 175.03us 203.64us cudaFree
0.04% 166.34us 114 1.4590us 141ns 63.183us cuDeviceGetAttribute
0.00% 15.073us 1 15.073us 15.073us 15.073us cuDeviceGetName
0.00% 7.6820us 1 7.6820us 7.6820us 7.6820us cuDeviceGetPCIBusId
0.00% 1.4780us 3 492ns 191ns 1.0030us cuDeviceGetCount
0.00% 1.0420us 2 521ns 184ns 858ns cuDeviceGet
0.00% 506ns 1 506ns 506ns 506ns cuDeviceTotalMem
0.00% 447ns 1 447ns 447ns 447ns cuModuleGetLoadingMode
0.00% 277ns 1 277ns 277ns 277ns cuDeviceGetUuid
==3260685== Unified Memory profiling result:
Device "NVIDIA TITAN X (Pascal) (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
48 170.67KB 4.0000KB 0.9961MB 8.000000MB 759.4160us Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 339.1450us Device To Host
12 - - - - 1.972004ms Gpu page fault groups
Total CPU Page faults: 36
<<<numBlocks, blockSize>>>
add<<<1, 256>>>(N, x, y);
I need to modify the kernel to spread the computation across the parallel threads.
__global__
void add(int n, float *x, float *y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
threadIdx.x contains the index of the current thread within its block, and blockDim.x contains the number of threads in the block.
==3271798== NVPROF is profiling process 3271798, command: ./add_cuda
Max error: 0
==3271798== Profiling application: ./add_cuda
==3271798== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.5162ms 1 2.5162ms 2.5162ms 2.5162ms add(int, float*, float*)
API calls: 89.35% 167.09ms 2 83.545ms 24.286us 167.07ms cudaMallocManaged
8.97% 16.782ms 1 16.782ms 16.782ms 16.782ms cudaLaunchKernel
1.34% 2.5153ms 1 2.5153ms 2.5153ms 2.5153ms cudaDeviceSynchronize
0.21% 399.97us 2 199.99us 182.70us 217.27us cudaFree
0.10% 191.04us 114 1.6750us 136ns 73.971us cuDeviceGetAttribute
0.01% 17.896us 1 17.896us 17.896us 17.896us cuDeviceGetName
0.00% 8.7150us 1 8.7150us 8.7150us 8.7150us cuDeviceGetPCIBusId
0.00% 1.4980us 3 499ns 197ns 927ns cuDeviceGetCount
0.00% 1.1800us 2 590ns 186ns 994ns cuDeviceGet
0.00% 799ns 1 799ns 799ns 799ns cuDeviceTotalMem
0.00% 460ns 1 460ns 460ns 460ns cuModuleGetLoadingMode
0.00% 328ns 1 328ns 328ns 328ns cuDeviceGetUuid
==3271798== Unified Memory profiling result:
Device "NVIDIA TITAN X (Pascal) (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
48 170.67KB 4.0000KB 0.9961MB 8.000000MB 758.6180us Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 340.6860us Device To Host
12 - - - - 1.743612ms Gpu page fault groups
Total CPU Page faults: 36
CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors, or _SM_s. Each SM can run multiple concurrent thread blocks. As an example, a Titan X GPU based on the Pascal GPU Architecture has 28 SMs, each capable of supporting up to 2048 active threads.
The blocks of parallel threads make up what is known as the grid.
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
__global__
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
==3275921== NVPROF is profiling process 3275921, command: ./add_cuda
Max error: 0
==3275921== Profiling application: ./add_cuda
==3275921== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.0485ms 1 2.0485ms 2.0485ms 2.0485ms add(int, float*, float*)
API calls: 89.46% 155.38ms 2 77.692ms 24.574us 155.36ms cudaMallocManaged
9.01% 15.646ms 1 15.646ms 15.646ms 15.646ms cudaLaunchKernel
1.18% 2.0493ms 1 2.0493ms 2.0493ms 2.0493ms cudaDeviceSynchronize
0.23% 398.66us 2 199.33us 182.16us 216.50us cudaFree
0.10% 181.72us 114 1.5940us 180ns 72.099us cuDeviceGetAttribute
0.01% 16.007us 1 16.007us 16.007us 16.007us cuDeviceGetName
0.01% 9.8340us 1 9.8340us 9.8340us 9.8340us cuDeviceGetPCIBusId
0.00% 1.4360us 3 478ns 232ns 914ns cuDeviceGetCount
0.00% 1.0640us 2 532ns 192ns 872ns cuDeviceGet
0.00% 630ns 1 630ns 630ns 630ns cuDeviceTotalMem
0.00% 458ns 1 458ns 458ns 458ns cuModuleGetLoadingMode
0.00% 360ns 1 360ns 360ns 360ns cuDeviceGetUuid
==3275921== Unified Memory profiling result:
Device "NVIDIA TITAN X (Pascal) (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
97 84.453KB 4.0000KB 0.9922MB 8.000000MB 803.9990us Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 340.3340us Device To Host
16 - - - - 2.437333ms Gpu page fault groups
Total CPU Page faults: 36
This type of loop in a CUDA kernel is often called a grid-stride loop.
Version | Time | Bandwith |
---|---|---|
1 CUDA Thread | 174.22ms | Cell |
1 CUDA Block | 2.5162ms | Cell |
Many CUDA Blocks | 2.1255ms | Cell |
Many GPUs | Cell | Cell |
A block consists many threads. In our case, block_dim == block_size == num_threads = 256
.
Similarly, a grid consists many blocks. In our case, grid_dim == grid_size = 4096
.
The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
https://devblogs.nvidia.com/even-easier-introduction-cuda/