NVIDIA / gds-nvidia-fs

NVIDIA GPUDirect Storage Driver
Other
196 stars 31 forks source link

insmod worked but dmesg shows (nvidia-fs:write IO failed :-512) #26

Closed singhsaluja closed 11 months ago

singhsaluja commented 1 year ago

After a lot of struggle, I was able to build the (gds-nvidia-fs-2.17.0) on RHEL-9.2 (5.14.0-284.11.1.el9_2.x86_64) with nvidia-driver (525.89.02). The make worked fine and insmod nvidia-fs.ko didn't throw any errors.

[192745.286125] nvidia_fs: Initializing nvfs driver module
[192745.286136] nvidia_fs: registered correctly with major number 510

But when writing a file via gdsio utility to storage (VAST) which has an rpcrdma driver installed, the throughput speed wasn't expected, and dmesg shows

[Sat Sep  9 20:24:26 2023] nvidia-fs:write IO failed :-512
[Sat Sep  9 20:24:26 2023] nvidia-fs:write IO failed :-512
[Sat Sep  9 20:24:26 2023] nvidia-fs:write IO failed :-512
[Sat Sep  9 20:24:26 2023] nvidia-fs:write IO failed :-512
[Sat Sep  9 20:24:58 2023] nvidia-fs:write IO failed :-512

FWIW, the gdscheck.py utility reports NFS is supported

./gdscheck.py -p
 GDS release version: 1.7.2.10
 nvidia_fs version:  2.17 libcufile version: 2.12
 Platform: x86_64

NFS                : Supported

Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0

I am unsure how to debug this. Any leads would be really appreciated. Thank you!

wakaba-best commented 1 year ago

@singhsaluja Would this be helpful to you? There is a difference between Ubuntu and Local NVMe. https://github.com/NVIDIA/gds-nvidia-fs/issues/4#issuecomment-1537336047