Bitonic Sort on fails on CUDA with (error code an illegal memory access was encountered)

developedby commented 1 month ago

Originally from https://github.com/HigherOrderCO/Bend/issues/364 by user @ethanbarry

Description

When I run the compiled CUDA bitonic sorter example (linked in the README) I get this error:

Failed to launch kernels (error code an illegal memory access was encountered)!

To Reproduce

Steps to reproduce the behavior:

bend gen-cu sorter.bend > sorter.cu
nvcc sorter.cu -o sorter
prime-run ./sorter (Launches it on the GPU for Arch Linux.)
Error recieved.

Expected behavior

The program runs on the GPU. Desktop (please complete the following information):

OS: Linux (Arch 6.9.1-arch1-1)
CPU: Intel i7-11800H
GPU: RTX 3050 Ti Mobile
GPU Driver: Nvidia open kernel modules v550.78
CUDA release 12.4, V12.4.131

Additional context

The program runs using the C codegen backend, but with the CUDA backend, it seems to fail regardless of what I do. If anyone is curious about the prime-run command, it's really just a script that forces the dGPU to handle a task - nothing fancy.

NotCyberLemon commented 1 month ago

I am here from the main Bend repo with issue #Bitonic Sort example failed with GPU kernel error.

I too am having a kernel memory issue:

$ ./sorter # The same as prime-run due to environment variables already being set.
| Failed to launch kernels (error code an illegal memory access was encountered)!

I am also running a mobile gpu where I am getting this issue.

Some GPU properties and info from exec:

--- General Information for device 0 ---
Name: NVIDIA GeForce RTX 3060 Laptop GPU
Compute capability: 8.6
Clock rate: 1425000
Device copy overlap: Enabled
Kernel execution timeout: Enabled

--- Memory Information for device 0 ---
Total global memory: 5996544000
Total constant memory: 65536
Max memory pitch: 2147483647
Texture alignment: 512

--- MP Information for device 0 ---
Multiprocessor count: 30
Shared memory per MP: 49152
Registers per MP: 65536
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions: (1024, 1024, 64)
Max grid dimensions: (2147483647, 65535, 65535)

--- Memory Allocation Test ---
Memory allocation successful!

Specs:

OS: Arch Linux x86_64
Kernel: 6.9.1-zen1-1-zen
GPU: NVIDIA GeForce RTX 3060 Mobile / Max-Q
GPU Driver: nvidia-open-dkms 550.78-4
CUDA Version: 12.4.1-4

As well as that, running it through bend run-cu ./sorter it seems to run indefinitely, though - after a while of testing - I am unable to find what exactly is the cause nor what the execution is being caught on.

2lian commented 1 month ago

I had the same issue. I cloned, HVM changed LNet seeting according to #283 , but the current repo V2.0.14 does not work with bend, and I do not know where V2.0.13 (for bend) is.

I never used cargo so excuse me if I am doing some black magic here, but this is how I fixed it for bend:

mkdir ~/hvmtmp
cd ~/hvmtmp
cargo init
cargo add hvm@=2.0.13
cargo vendor vendor
cd vendor/hvm

You are now inside the source of hvm V2.0.13.

Open and edit src/hvm.cu. Line 334 reduce L_NODE_LEN and L_VARS_LEN, but do not reduce too much. This value works on my GTX 1080Ti:

// Local Net
const u32 L_NODE_LEN = 0x2000/4;
const u32 L_VARS_LEN = 0x2000/4;
struct LNet {
  Pair node_buf[L_NODE_LEN];
  Port vars_buf[L_VARS_LEN];
};

Now go back to hvm V2.0.13 you downloaded and install it:

cd ~/hvmtmp/vendor/hvm
cargo +nightly install --path .

This should work, you can now delete ~/hvmtmp.

VictorTaelin commented 1 month ago

I wonder why they needed /4 there - 0x1000 should be safe for every architecture, shouldn't it? AFAIK all devices support 48KB shared memory. Perhaps this is using a little bit more, due to the other shared structures?

2lian commented 1 month ago

I wonder why they needed /4 there - 0x1000 should be safe for every architecture

To report more about this, on my GTX 1080Ti (using WSL2, cuda toolkit 12.3), I have tried:

[ ] 0x2000
[ ] 0x1000 = 0x2000/2
[ ] 0x2000/3
[x] 0x2000/4
[x] 0x0500
[ ] 0x0100

Only 0x2000/4 and 0x0500 did work.

gladmo commented 1 month ago

all tries not work for me, on my GTX 1050 Ti.

OS: CentOS Linux release 7.9.2009 (Core)
CPU: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
GPU: GTX 1050 Ti
GPU Driver: Nvidia open kernel modules v550.78
CUDA release 12.4, V12.4.131

$ nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   58C    P8             N/A /   72W |       2MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

test example:

$ time bend run-c sorter.bend
Result: 16646144
bend run-c sorter.bend  47.63s user 0.32s system 435% cpu 11.001 total

$ time bend run-cu sorter.bend
Errors:
1.Failed to parse result from HVM.
Output from HVM was:
"Failed to launch kernels. Error code: an illegal memory access was encountered.\n""exit status: 1"""

bend run-cu sorter.bend  0.03s user 0.06s system 89% cpu 0.097 total

TimotejFasiang commented 1 month ago

Did anyone manage to find some L_NODE_LEN and L_VAR_LEN values that work for other GPUs?

HigherOrderCO / HVM

Bitonic Sort on fails on CUDA with (error code an illegal memory access was encountered) #314