Open developedby opened 1 month ago
I am here from the main Bend repo with issue #Bitonic Sort example failed with GPU kernel error.
I too am having a kernel memory issue:
$ ./sorter # The same as prime-run due to environment variables already being set.
| Failed to launch kernels (error code an illegal memory access was encountered)!
I am also running a mobile gpu where I am getting this issue.
Some GPU properties and info from exec:
--- General Information for device 0 ---
Name: NVIDIA GeForce RTX 3060 Laptop GPU
Compute capability: 8.6
Clock rate: 1425000
Device copy overlap: Enabled
Kernel execution timeout: Enabled
--- Memory Information for device 0 ---
Total global memory: 5996544000
Total constant memory: 65536
Max memory pitch: 2147483647
Texture alignment: 512
--- MP Information for device 0 ---
Multiprocessor count: 30
Shared memory per MP: 49152
Registers per MP: 65536
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions: (1024, 1024, 64)
Max grid dimensions: (2147483647, 65535, 65535)
--- Memory Allocation Test ---
Memory allocation successful!
Specs:
OS: Arch Linux x86_64
Kernel: 6.9.1-zen1-1-zen
GPU: NVIDIA GeForce RTX 3060 Mobile / Max-Q
GPU Driver: nvidia-open-dkms 550.78-4
CUDA Version: 12.4.1-4
As well as that, running it through bend run-cu ./sorter
it seems to run indefinitely, though - after a while of testing - I am unable to find what exactly is the cause nor what the execution is being caught on.
I had the same issue. I cloned, HVM changed LNet seeting according to #283 , but the current repo V2.0.14 does not work with bend, and I do not know where V2.0.13 (for bend) is.
I never used cargo so excuse me if I am doing some black magic here, but this is how I fixed it for bend:
mkdir ~/hvmtmp
cd ~/hvmtmp
cargo init
cargo add hvm@=2.0.13
cargo vendor vendor
cd vendor/hvm
You are now inside the source of hvm V2.0.13.
Open and edit src/hvm.cu
. Line 334
reduce L_NODE_LEN
and L_VARS_LEN
, but do not reduce too much. This value works on my GTX 1080Ti:
// Local Net
const u32 L_NODE_LEN = 0x2000/4;
const u32 L_VARS_LEN = 0x2000/4;
struct LNet {
Pair node_buf[L_NODE_LEN];
Port vars_buf[L_VARS_LEN];
};
Now go back to hvm V2.0.13 you downloaded and install it:
cd ~/hvmtmp/vendor/hvm
cargo +nightly install --path .
This should work, you can now delete ~/hvmtmp
.
I wonder why they needed /4
there - 0x1000
should be safe for every architecture, shouldn't it? AFAIK all devices support 48KB
shared memory. Perhaps this is using a little bit more, due to the other shared structures?
I wonder why they needed /4 there - 0x1000 should be safe for every architecture
To report more about this, on my GTX 1080Ti (using WSL2, cuda toolkit 12.3), I have tried:
0x2000
0x1000
= 0x2000/2
0x2000/3
0x2000/4
0x0500
0x0100
Only 0x2000/4
and 0x0500
did work.
all tries not work for me, on my GTX 1050 Ti.
OS: CentOS Linux release 7.9.2009 (Core)
CPU: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
GPU: GTX 1050 Ti
GPU Driver: Nvidia open kernel modules v550.78
CUDA release 12.4, V12.4.131
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1050 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 58C P8 N/A / 72W | 2MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
test example:
$ time bend run-c sorter.bend
Result: 16646144
bend run-c sorter.bend 47.63s user 0.32s system 435% cpu 11.001 total
$ time bend run-cu sorter.bend
Errors:
1.Failed to parse result from HVM.
Output from HVM was:
"Failed to launch kernels. Error code: an illegal memory access was encountered.\n""exit status: 1"""
bend run-cu sorter.bend 0.03s user 0.06s system 89% cpu 0.097 total
Did anyone manage to find some L_NODE_LEN
and L_VAR_LEN
values that work for other GPUs?
Originally from https://github.com/HigherOrderCO/Bend/issues/364 by user @ethanbarry
Description
When I run the compiled CUDA bitonic sorter example (linked in the README) I get this error:
Failed to launch kernels (error code an illegal memory access was encountered)!
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The program runs on the GPU. Desktop (please complete the following information):
Additional context
The program runs using the C codegen backend, but with the CUDA backend, it seems to fail regardless of what I do. If anyone is curious about the prime-run command, it's really just a script that forces the dGPU to handle a task - nothing fancy.