Open ZaidQureshi opened 5 years ago
I'm not sure to be honest. My initial suspicion is that there may be some issue with peer-to-peer. Does this only happen with the CUDA benchmark? Does the nvm-latency-bench
program work when using a GPU buffer?
So when I run nvm-latency-bench like:
./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --count=1 --pages=1 --queue 'no=1,depth=32' --outer=1 --inner=5
and it seems to be fine.
So I guess it only happens with the cuda benchmark.
I just tested on a completely different system with a different SSD and it still doens't work. Are you doing some sort of math with the # of chunks, # of pages per chunk, # of threads for some nvme setup in this benchmark that requires the total number of pages requested to be some specific numbers? Because I see the same issue when I run with chunks=1 pages=1 threads=5.
It seems that you ran the nvm-latency-bench
without a GPU argument. By default it will simply allocate the buffer in RAM. Please try adding --gpu=0
to the command line arguments, and additionally the --verify
option.
As for the calculations of chunks, pages and threads: Yes, it definitively is a bit iffy and I haven't tested it properly and I suspect that there may be some bugs/issues with offsets there. I've usually tested with powers of two for the number of threads, the number of pages and chunks set to 1 should be okay.
Is there anything showing up in the system log (dmesg
) when you run the CUDA benchmark?
So when I run:
./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --count=1 --pages=1 --queue 'no=1,depth=32' --outer=1 --inner=5 --gpu=6 --verify --infile=Makefile
I get the following output:
Resetting controller... DONE
Preparing queues... DONE
Reading input file... DONE
Preparing buffers and transfer lists... DONE
Running latency benchmark (reading, sequential, 1 iterations)... DONE
Verifying buffers... DONE
Calculating percentiles...
Queue #01 read percentiles (1 samples)
bandwidth, adj iops, cmd latency, prp latency
max: 78.745, 19224.928, 52.016, 52.016
0.99: 0.000, 0.000, 0.000, 0.000
0.97: 0.000, 0.000, 0.000, 0.000
0.95: 0.000, 0.000, 0.000, 0.000
0.90: 0.000, 0.000, 0.000, 0.000
0.75: 0.000, 0.000, 0.000, 0.000
0.50: 0.000, 0.000, 0.000, 0.000
0.25: 0.000, 0.000, 0.000, 0.000
0.10: 0.000, 0.000, 0.000, 0.000
0.05: 0.000, 0.000, 0.000, 0.000
0.01: 0.000, 0.000, 0.000, 0.000
min: 78.745, 19224.928, 52.016, 52.016
End percentiles
OK!
If I remove the very and the infile I get the following output:
Resetting controller... DONE
Preparing queues... DONE
Preparing buffers and transfer lists... DONE
Running latency benchmark (reading, sequential, 1 iterations)... DONE
Calculating percentiles...
Queue #01 read percentiles (1 samples)
bandwidth, adj iops, cmd latency, prp latency
max: 79.029, 19294.069, 51.829, 51.829
0.99: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
0.97: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
0.95: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
0.90: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
0.75: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
0.50: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
0.25: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
0.10: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
0.05: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
0.01: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
min: 79.029, 19294.069, 51.829, 51.829
End percentiles
OK!
No messages in dmesg.
Have you been able to replicate the issue?
As for the calculations of chunks, pages and threads: Yes, it definitively is a bit iffy and I haven't tested it properly and I suspect that there may be some bugs/issues with offsets there.
Are you using these for the creation of the queue memory? for the prp list memory?
By the way I am using the code from the master branch. Although i have tried the other branch, gives same affect.
Hi,
I am at my cabin right now so I don't have any access to hardware at the moment. I will be back at the office at the beginning of next week.
So when I run: ./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --count=1 --pages=1 --queue 'no=1,depth=32' --outer=1 --inner=5 --gpu=6 --verify --infile=Makefile
I get the following output:
It's good that the using a GPU with the verify option works, that rules out any issues with PCIe peer-to-peer. nvm-latency-bench
defaults to using namespace NSID 1 as well.
Just to explain what's going on here:
--gpu
option, the memory buffer is hosted on the GPU, so the disk is DMAing directly to GPU memory. --verify
option combined with the --infile
option, the allocated memory is first set to 0 before reading from disk. After reading from disk, the program compares the content of the memory buffer to the file contents of the input file.So I'm fairly sure that at least that works.
If I remove the very and the infile I get the following output:
As for the weird percentiles output, that's a known bug. It's caused by an arithmetic overflow because the number of repetitions is too low. I'll fix this at some point, but it's just an annoyance so it's not a high priority.
Are you using these for the creation of the queue memory? for the prp list memory?
Yes. There might be something buggy going on there. But with chunks and pages set to 1, the calculation should be fairly straight forward despite any calculation errors. Have you tried setting the number of threads to 1?
Out of interest, what GPU and disk are you using? Maybe I can try to reproduce when I get back. Can you confirm that the GPU supports GPUDirect Async?
Is this issue by any chance related to your experience from #25 ?
I have tried 1 thread 1 chunk and 1 page and that's fine.
I guess it's the same issue as #25
I see. I'll try to reproduce the issue later this week then. Thank you for reporting it. At this point, I believe some offset calculation is very likely the culprit.
I believe I have confirmed that there is some issue with illegal memory accessing for some parameters. I will try to look into it as soon as I am able to.
If/when you know where this is happening (or with what structure) could you please let me know? Thanks.
Hi,
Any updates on this? Thanks.
I'm currently unable to reproduce this, I've tried with the different combinations applied in #25 but it seems to work. However, I see that in my SISCI branch, I've made a restriction to only allow thread count as power of two. The bug I thought I was able to reproduce was something unrelated.
Ok thank you for looking into it. Could it maybe be a BIOS feature I need to enable/disable?
Other details that may be of use: I am running this on Linux kernel version 4.15 I have been able to reproduce this error on 2 machines (with 2 different SSDs with different controllers and 2 different GPUs). May I ask what SSD you use for testing?
I'll look into it some more, I haven't ruled out that there is some form of alignment/overlap issue that doesn't happen on my system but may happen on other systems. I'm not aware of any BIOS setting that might affect it.
In the past I've tested with Samsung Evo 960 and 970 in the past, some non-Optane Intel disks I don't recall the model names of ATM, Intel Optane 900P and Intel Optane 4800X. I've only used the two Optane disks when trying to reproduce this issue though, so I can try using one of the other disks.
I'll see what I can do in order to try to reproduce it, but I'm pretty swamped with other stuff the next couple of weeks. Just out of curiosity, have you tried both branches? You may have to run make clean and even cmake again after switching to the other branch.
Yes I have tried both branches and I get the same result.
One more question, what distro, kernel version, cuda version, and nvidia driver version do you use?
I've tested CentOS and Fedora in the past and with different CUDA versions (and it has worked), but I've tried replicating this issue using Ubuntu 18.04.2 with CUDA 10.1. The driver version is 418.40.04, as reported by nvidia-smi
.
To clarify, it's just some combinations of arguments that hangs, right? Or does all hang now?
Yes it is just some combinations, many others work. But I am concerned that since some combinations don't work, there is something fundamentally wrong going on (like some overlap or miscalculated indices)
I totally understand, and I will try to look into it when I have the time.
For those combinations that appear to work it should be possible to verify output from disk by using the --output
(or maybe it's --outfile
) option which will write what is read to file. You could then use another program such as nvm-latency-bench
also with the --output
option to read out a matching number of pages and write to a file. Note that in the sisci-5.11
branch, you need to comment back in the for-loop in the moveBytes
function in main.cu
(https://github.com/enfiskutensykkel/ssd-gpu-dma/blob/sisci-5.11/benchmarks/cuda/main.cu#L85)
While this does not guarantee that ranges aren't overlapping, it at least should provide some sort of confirmation that the entire range is covered and that all chunks are read at least once. Maybe not very assuring, but it at least confirms that data is read from the disk.
So that is what I have been doing, i write data to the disk and then use the cuda benchmark with the output flag and I get correct values, at least as far as I remember, for the combinations that work. The data read is correct. I will do some more extensive testing over the weekend.
I tested the output for configs that work and the output seems to be correct.
Is there a limit on how many dma mappings can be creating using GPUDirect?
There's no limitation on the number of mappings, but most GPUs have a limitation on how much memory can be "pinned" and exposed for third-party devices. This limitation is usually around 128-256 MB. Depending on the GPU, there is also the memory alignment requirement (that pinned memory must be 64 K aligned).
The limitation should be what is reported by the nvidia-smi -q command, according to the following link, right?: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#display-bar-space
That's correct. You can also see the BAR size using lspci
.
I am using a volta gpu for my testing, if that matters.
Also, when the SSD does and DMA into GPU memory, does it invalidate any cached lines (for the region being written to) in the GPU's caches?
I am running the cuda benchmark from your codebase, with the following output for the controller and command line configuration:
The problem is the thread never finishes polling for the first chunk. So I exit out, reload the regular nvme driver and check the device's error log. When I check the device's error log, I see the following entry for each time I try to run the benchmark:
The nvme ssd has only 1 namespace (NSID: 1) and its the one being used for all commands in the codebase. So what could be the issue? Any help in this matter will be appreciated.