Closed sxzzsf closed 11 months ago
Hi,
Thanks for trying this library. Which branch are you running?
I suspect that it might be an issue with ringing the doorbell register on the NVMe, I'm not familiar with the A2000, but is it possible to verify that it is able to do GPUDirect Async ? (note that this feature is different to GPUDirect RDMA)
Can you print the pointer returned by the call to cudaHostRegister
in main.cu
and compare that to the illegal or out of bounds address reported by compute-sanitizer
(specifically this line: Address 0xfffcd0 is out of bounds
)? I assume that the address might change between runs, so you might need to compare the printed address and the reported address from the same run.
https://github.com/enfiskutensykkel/ssd-gpu-dma/blob/master/benchmarks/cuda/main.cu#L690
Also, I think there are some calculation issues wrt offsets that I never got around to fixing, so you can provide --threads 1
, --chunks 1
and --pages 1
to simplify what's going on.
Hi, follow your comments, and I test the bench with various options. Found two unexecpted results:
timeout (wait for 10 seconds)
when run with --threads=1 --chunks=1 --pages=x (x=1...15,17...31,33...47,...)
illegal memory access
when run with --threads=1 --chunks=1 --pages=y (y=16,32,48,...)
And if the NVMe disk (Samsun 970 Pro 1TB) filled with 0xcc pattern.
The qp structure is all overwritten with 0xcc after waitForIoCompletion()
in readSingleBuffered().
Hmm, this is kinda strange. Maybe this is related to #29 ? Can you please try the sisci-5.11
branch. I believe it is more accurate wrt to calculating offsets and there may be some other fixes there as well, it's been a while since I worked on this.
Also, if this doesn't work then there might be a different issue. You can use the nvm-latency-bench
benchmark (preferably from the sisci-5.11
branch) to do some debugging, by specifying --gpu
and --infile
and --outfile
(and --verify
). It's a bit tricky to use, since you need to specify queues, but something like `--queue="no=3,depth=1" should work. The timeout issue may be caused by the thread waiting indefinitely for NVMe command completions that never arrive, i.e., the disk is not writing to GPU memory.
Switch to branch sisci-5.11
.
Same test result of nvm-cuda-bench
to the master branch.
And ./bin/nvm-latency-bench --ctrl /dev/libnvm2 --count 1 --queue no=1,cmds=1,prps=1 --infile ~/random2.dat --verify --gpu 0
test passed on host memory.
This should be solved by the fix from @angletzlh
$
cmake .. -DCMAKE_BUILD_TYPE=Debug -Dnvidia_archs="86"
$make identify module cuda-benchmark
$sudo rmmod nvme
$sudo make -C module load
$
deviceQuery
...... Device 0: "NVIDIA RTX A2000" CUDA Driver Version / Runtime Version 12.0 / 11.8 CUDA Capability Major/Minor version number: 8.6 ......$
sudo ./bin/nvm-identify --ctrl=/dev/libnvm0
Resetting controller and setting up admin queues... ------------- Controller information ------------- PCI Vendor ID : 4d 14 PCI Subsystem Vendor ID : 4d 14 NVM Express version : 1.3.0 Controller page size : 4096 Max queue entries : 16384 Serial Number : S5GXNG0N905360MModel Number : Samsung SSD 980 PRO 1TB
Firmware revision : 5B2QGXA7 Max data transfer size : 524288 Max outstanding commands: 256 Max number of namespaces: 1 Current number of CQs : 129 Current number of SQs : 129
When run $
sudo ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0
CUDA device : 0 NVIDIA RTX A2000 (0000:98:00.0) Controller page size : 4096 B Namespace block size : 512 B Number of threads : 32 Chunks per thread : 32 Pages per chunk : 1 Total number of pages : 1024 Total number of blocks: 8192 Double buffering : no Unexpected error: Unexpected CUDA error: an illegal memory access was encountered$', name=, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 0): Out Of Range Address
[ 484.710999] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics SM Global Exception on (GPC 2, TPC 0, SM 0): Multiple Warp Errors
[ 484.711014] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics Exception: ESR 0x514730=0x201000e 0x514734=0x24 0x514728=0xc81eb60 0x51472c=0x1174
[ 484.711584] NVRM: Xid (PCI:0000:98:00): 43, pid=2037, name=nvm-cuda-bench, Ch 00000008
dmesg
[ 484.710982] NVRM: Xid (PCI:0000:98:00): 13, pid='And if run under compute-sanitizer $(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0x1fd21]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/tmp/tmpxft_00002ae3_00000000-6_main.cudafe1.stub.c:1:__device_stub__Z18readSingleBufferedP9QueuePairmPvS1_mmPmP7CmdTime(QueuePair*, unsigned long, void*, void*, unsigned long, unsigned long, unsigned long*, CmdTime*) [0x1fab2]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:306:readSingleBuffered(QueuePair*, unsigned long, void*, void*, unsigned long, unsigned long, unsigned long*, CmdTime*) [0x1fb2a]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:450:launchNvmKernel(Controller const&, std::shared_ptr, Settings const&, cudaDeviceProp const&) [0x1dd7f]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:698:main [0x1ee3a]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:../csu/libc-start.c:342:__libc_start_main [0x24083]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x1bf8e]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
<...... Same trace from GPU thread 1 to 31 .......>
sudo /usr/local/cuda/bin/compute-sanitizer ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0
========= COMPUTE-SANITIZER CUDA device : 0 NVIDIA RTX A2000 (0000:98:00.0) Controller page size : 4096 B Namespace block size : 512 B Number of threads : 32 Chunks per thread : 32 Pages per chunk : 1 Total number of pages : 1024 Total number of blocks: 8192 Double buffering : no ========= Invalid __local__ write of size 16 bytes ========= at 0x3e0 in readSingleBuffered(QueuePair *, unsigned long, void *, void *, unsigned long, unsigned long, unsigned long *, CmdTime *) ========= by thread (0,0,0) in block (0,0,0) ========= Address 0xfffcd0 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x302a52] ========= in /lib/x86_64-linux-gnu/libcuda.so.1 ========= Host Frame:__cudart798 [0x30e0b] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench ========= Host Frame:cudaLaunchKernel [0x8cd0b] ========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench ========= Host Frame:/usr/local/cuda/include/cuda_runtime.h:216:cudaError cudaLaunchKernel========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaEventDestroy. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaEventDestroy. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaHostUnregister. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree. <...... host backtrace omitted ......> ========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFreeHost. <...... host backtrace omitted ......>
Unexpected error: Unexpected CUDA error: unspecified launch failure ========= Target application returned an error ========= ERROR SUMMARY: 42 errors