enfiskutensykkel / ssd-gpu-dma

Build userspace NVMe drivers and storage applications with CUDA support
BSD 2-Clause "Simplified" License
342 stars 47 forks source link

nvm-cuda-bench failed as "an illegal memory access was encountered" #32

Closed sxzzsf closed 11 months ago

sxzzsf commented 1 year ago
  1. Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
  2. Supermicro X12DPi-N6
  3. NVIDIA RTX A2000
  4. Samsung 980 Pro nvme
  5. Ubuntu 20.04.5 / 5.4.0-135-generic / cuda_12.0.0_525.60.13_linux

$ cmake .. -DCMAKE_BUILD_TYPE=Debug -Dnvidia_archs="86" $ make identify module cuda-benchmark $ sudo rmmod nvme $ sudo make -C module load

$ deviceQuery ...... Device 0: "NVIDIA RTX A2000" CUDA Driver Version / Runtime Version 12.0 / 11.8 CUDA Capability Major/Minor version number: 8.6 ......

$ sudo ./bin/nvm-identify --ctrl=/dev/libnvm0 Resetting controller and setting up admin queues... ------------- Controller information ------------- PCI Vendor ID : 4d 14 PCI Subsystem Vendor ID : 4d 14 NVM Express version : 1.3.0 Controller page size : 4096 Max queue entries : 16384 Serial Number : S5GXNG0N905360M
Model Number : Samsung SSD 980 PRO 1TB
Firmware revision : 5B2QGXA7 Max data transfer size : 524288 Max outstanding commands: 256 Max number of namespaces: 1 Current number of CQs : 129 Current number of SQs : 129

When run $ sudo ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 CUDA device : 0 NVIDIA RTX A2000 (0000:98:00.0) Controller page size : 4096 B Namespace block size : 512 B Number of threads : 32 Chunks per thread : 32 Pages per chunk : 1 Total number of pages : 1024 Total number of blocks: 8192 Double buffering : no Unexpected error: Unexpected CUDA error: an illegal memory access was encountered

$ dmesg [ 484.710982] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 0): Out Of Range Address [ 484.710999] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics SM Global Exception on (GPC 2, TPC 0, SM 0): Multiple Warp Errors [ 484.711014] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics Exception: ESR 0x514730=0x201000e 0x514734=0x24 0x514728=0xc81eb60 0x51472c=0x1174 [ 484.711584] NVRM: Xid (PCI:0000:98:00): 43, pid=2037, name=nvm-cuda-bench, Ch 00000008

And if run under compute-sanitizer $ sudo /usr/local/cuda/bin/compute-sanitizer ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 ========= COMPUTE-SANITIZER CUDA device : 0 NVIDIA RTX A2000 (0000:98:00.0) Controller page size : 4096 B Namespace block size : 512 B Number of threads : 32 Chunks per thread : 32 Pages per chunk : 1 Total number of pages : 1024 Total number of blocks: 8192 Double buffering : no ========= Invalid __local__ write of size 16 bytes ========= at 0x3e0 in readSingleBuffered(QueuePair *, unsigned long, void *, void *, unsigned long, unsigned long, unsigned long *, CmdTime *) ========= by thread (0,0,0) in block (0,0,0) ========= Address 0xfffcd0 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x302a52] ========= in /lib/x86_64-linux-gnu/libcuda.so.1 ========= Host Frame:__cudart798 [0x30e0b] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench ========= Host Frame:cudaLaunchKernel [0x8cd0b] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench ========= Host Frame:/usr/local/cuda/include/cuda_runtime.h:216:cudaError cudaLaunchKernel(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0x1fd21] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench ========= Host Frame:/tmp/tmpxft_00002ae3_00000000-6_main.cudafe1.stub.c:1:__device_stub__Z18readSingleBufferedP9QueuePairmPvS1_mmPmP7CmdTime(QueuePair*, unsigned long, void*, void*, unsigned long, unsigned long, unsigned long*, CmdTime*) [0x1fab2] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench ========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:306:readSingleBuffered(QueuePair*, unsigned long, void*, void*, unsigned long, unsigned long, unsigned long*, CmdTime*) [0x1fb2a] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench ========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:450:launchNvmKernel(Controller const&, std::shared_ptr, Settings const&, cudaDeviceProp const&) [0x1dd7f] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench ========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:698:main [0x1ee3a] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench ========= Host Frame:../csu/libc-start.c:342:__libc_start_main [0x24083] ========= in /lib/x86_64-linux-gnu/libc.so.6 ========= Host Frame:_start [0x1bf8e] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench <...... Same trace from GPU thread 1 to 31 .......>

========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaEventDestroy. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaEventDestroy. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaHostUnregister. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFreeHost. <...... host backtrace omitted ......>

Unexpected error: Unexpected CUDA error: unspecified launch failure ========= Target application returned an error ========= ERROR SUMMARY: 42 errors

enfiskutensykkel commented 1 year ago

Hi,

Thanks for trying this library. Which branch are you running?

I suspect that it might be an issue with ringing the doorbell register on the NVMe, I'm not familiar with the A2000, but is it possible to verify that it is able to do GPUDirect Async ? (note that this feature is different to GPUDirect RDMA)

Can you print the pointer returned by the call to cudaHostRegister in main.cu and compare that to the illegal or out of bounds address reported by compute-sanitizer (specifically this line: Address 0xfffcd0 is out of bounds)? I assume that the address might change between runs, so you might need to compare the printed address and the reported address from the same run. https://github.com/enfiskutensykkel/ssd-gpu-dma/blob/master/benchmarks/cuda/main.cu#L690

Also, I think there are some calculation issues wrt offsets that I never got around to fixing, so you can provide --threads 1, --chunks 1 and --pages 1 to simplify what's going on.

sxzzsf commented 1 year ago

Hi, follow your comments, and I test the bench with various options. Found two unexecpted results:

  1. timeout (wait for 10 seconds) when run with --threads=1 --chunks=1 --pages=x (x=1...15,17...31,33...47,...)

  2. illegal memory access when run with --threads=1 --chunks=1 --pages=y (y=16,32,48,...) And if the NVMe disk (Samsun 970 Pro 1TB) filled with 0xcc pattern. The qp structure is all overwritten with 0xcc after waitForIoCompletion() in readSingleBuffered().

enfiskutensykkel commented 1 year ago

Hmm, this is kinda strange. Maybe this is related to #29 ? Can you please try the sisci-5.11 branch. I believe it is more accurate wrt to calculating offsets and there may be some other fixes there as well, it's been a while since I worked on this.

Also, if this doesn't work then there might be a different issue. You can use the nvm-latency-bench benchmark (preferably from the sisci-5.11 branch) to do some debugging, by specifying --gpu and --infile and --outfile (and --verify). It's a bit tricky to use, since you need to specify queues, but something like `--queue="no=3,depth=1" should work. The timeout issue may be caused by the thread waiting indefinitely for NVMe command completions that never arrive, i.e., the disk is not writing to GPU memory.

sxzzsf commented 1 year ago

Switch to branch sisci-5.11. Same test result of nvm-cuda-bench to the master branch. And ./bin/nvm-latency-bench --ctrl /dev/libnvm2 --count 1 --queue no=1,cmds=1,prps=1 --infile ~/random2.dat --verify --gpu 0 test passed on host memory.

enfiskutensykkel commented 11 months ago

This should be solved by the fix from @angletzlh