NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
MIT License
898 stars 144 forks source link

"gdrcopy_sanity" failed with 555-open driver on Grace-Hopper #299

Closed hanawa closed 4 months ago

hanawa commented 4 months ago
$ gdrcopy_sanity
Assertion "(gdr_pin_buffer(g, d_A[1], buffer_size, 0, 0, &A_mh[1])) == (0)" failed at sanity.cpp:446
Assertion "(gdr_pin_buffer(g, d_A, A_size, 0, 0, &A_mh)) == (0)" failed at sanity.cpp:354

At the same time, syslog message

[ 1222.559226] NVRM: Invalid argument in nv_p2p_get_pages,address or length are not aligned address=0xfffd39e00000, length=0x10008
[ 1222.559230] gdrdrv:__gdrdrv_pin_buffer:nvidia_p2p_get_pages(va=fffd39e00000 len=65544 p2p_token=0 va_space=0 callback=ffffc080a8de0230) failed [ret = -22]
[ 1222.857639] NVRM: Invalid argument in nv_p2p_get_pages,address or length are not aligned address=0xfffd39e10000, length=0x20200
[ 1222.857642] gdrdrv:__gdrdrv_pin_buffer:nvidia_p2p_get_pages(va=fffd39e10000 len=131584 p2p_token=0 va_space=0 callback=ffffc080a8de0230) failed [ret = -22]
[ 1227.696901] gdrdrv:__gdrdrv_pin_buffer:nvidia_p2p_get_pages(va=fffd39e00000 len=65536 p2p_token=0 va_space=fe00 callback=ffffc080a8de0230) failed [ret = -22]

In the case of CUDA driver 550-open, gdrcopy_sanity looks good, but assertion was failed on NVRM.

$ gdrcopy_sanity
Total: 28, Passed: 28, Failed: 0, Waived: 0
[  162.732145] NVRM: nvAssertFailedNoLog: Assertion failed: !(length & (NVRM_P2P_PAGESIZE_BIG_64K - 1)) @ p2p.c:654
[  163.030820] NVRM: nvAssertFailedNoLog: Assertion failed: !(length & (NVRM_P2P_PAGESIZE_BIG_64K - 1)) @ p2p.c:654
[  167.864091] gdrdrv:__gdrdrv_pin_buffer:nvidia_p2p_get_pages(va=fffd59e00000 len=65536 p2p_token=0 va_space=fe00 callback=ffffc060de4e0230) failed [ret = -22]
pakmarkthub commented 4 months ago

Hi @hanawa,

This is a known issue. We have already fixed it in the master branch. May I ask you to try it?

hanawa commented 4 months ago

Hi, @pakmarkthub I got this result by commit hash 'bb13928' on the latest master branch. Which version should I revert to?

hanawa commented 4 months ago

Please note that page size is 64K.

pakmarkthub commented 4 months ago

Thank you for your report. The fix was kept in the internal repository. I have just pushed it to the master branch. The current commit ID is 1366e20d140c5638fcaa6c72b373ac69f7ab2532. May I ask you to try again? You will need to recompile and reload gdrdrv.ko.

hanawa commented 4 months ago

It works without errors! But in syslog, a error statement was recorded.

[106887.190164] gdrdrv:__gdrdrv_pin_buffer:nvidia_p2p_get_pages(va=fffd59e00000 len=65536 p2p_token=0 va_space=fe00 callback=ffffc080cac60230) failed [ret = -22]
pakmarkthub commented 4 months ago

This is expected. It comes from one of our negative unit tests (https://github.com/NVIDIA/gdrcopy/blob/master/tests/sanity.cpp#L1812-L1889).