enfiskutensykkel / ssd-gpu-dma

Build userspace NVMe drivers and storage applications with CUDA support
BSD 2-Clause "Simplified" License
342 stars 47 forks source link

Invalid NSID #26

Open ZaidQureshi opened 5 years ago

ZaidQureshi commented 5 years ago

I am running the cuda benchmark from your codebase, with the following output for the controller and command line configuration:

Controller page size  : 4096 B
Namespace block size  : 4096 B
Number of threads     : 1
Chunks per thread     : 1
Pages per chunk       : 5
Total number of pages : 5
Total number of blocks: 5
Double buffering      : no

The problem is the thread never finishes polling for the first chunk. So I exit out, reload the regular nvme driver and check the device's error log. When I check the device's error log, I see the following entry for each time I try to run the benchmark:

sqid         : 1
cmdid        : 0
status_field : 0x4016(INVALID_NS)
parm_err_loc : 0xffff
lba          : 0
nsid         : 0x1
vs           : 0

The nvme ssd has only 1 namespace (NSID: 1) and its the one being used for all commands in the codebase. So what could be the issue? Any help in this matter will be appreciated.

enfiskutensykkel commented 5 years ago

I'm not sure to be honest. My initial suspicion is that there may be some issue with peer-to-peer. Does this only happen with the CUDA benchmark? Does the nvm-latency-bench program work when using a GPU buffer?

ZaidQureshi commented 5 years ago

So when I run nvm-latency-bench like: ./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --count=1 --pages=1 --queue 'no=1,depth=32' --outer=1 --inner=5 and it seems to be fine.

So I guess it only happens with the cuda benchmark.

ZaidQureshi commented 5 years ago

I just tested on a completely different system with a different SSD and it still doens't work. Are you doing some sort of math with the # of chunks, # of pages per chunk, # of threads for some nvme setup in this benchmark that requires the total number of pages requested to be some specific numbers? Because I see the same issue when I run with chunks=1 pages=1 threads=5.

enfiskutensykkel commented 5 years ago

It seems that you ran the nvm-latency-bench without a GPU argument. By default it will simply allocate the buffer in RAM. Please try adding --gpu=0 to the command line arguments, and additionally the --verify option.

As for the calculations of chunks, pages and threads: Yes, it definitively is a bit iffy and I haven't tested it properly and I suspect that there may be some bugs/issues with offsets there. I've usually tested with powers of two for the number of threads, the number of pages and chunks set to 1 should be okay.

Is there anything showing up in the system log (dmesg) when you run the CUDA benchmark?

ZaidQureshi commented 5 years ago

So when I run: ./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --count=1 --pages=1 --queue 'no=1,depth=32' --outer=1 --inner=5 --gpu=6 --verify --infile=Makefile

I get the following output:

Resetting controller... DONE
Preparing queues... DONE
Reading input file... DONE
Preparing buffers and transfer lists... DONE
Running latency benchmark (reading, sequential, 1 iterations)... DONE
Verifying buffers... DONE
Calculating percentiles...
Queue #01 read percentiles (1 samples)
            bandwidth,       adj iops,    cmd latency,    prp latency
  max:         78.745,      19224.928,         52.016,         52.016
 0.99:          0.000,          0.000,          0.000,          0.000
 0.97:          0.000,          0.000,          0.000,          0.000
 0.95:          0.000,          0.000,          0.000,          0.000
 0.90:          0.000,          0.000,          0.000,          0.000
 0.75:          0.000,          0.000,          0.000,          0.000
 0.50:          0.000,          0.000,          0.000,          0.000
 0.25:          0.000,          0.000,          0.000,          0.000
 0.10:          0.000,          0.000,          0.000,          0.000
 0.05:          0.000,          0.000,          0.000,          0.000
 0.01:          0.000,          0.000,          0.000,          0.000
  min:         78.745,      19224.928,         52.016,         52.016
End percentiles
OK!

If I remove the very and the infile I get the following output:

Resetting controller... DONE
Preparing queues... DONE
Preparing buffers and transfer lists... DONE
Running latency benchmark (reading, sequential, 1 iterations)... DONE
Calculating percentiles...
Queue #01 read percentiles (1 samples)
            bandwidth,       adj iops,    cmd latency,    prp latency
  max:         79.029,      19294.069,         51.829,         51.829
 0.99: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
 0.97: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
 0.95: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
 0.90: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
 0.75: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
 0.50: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
 0.25: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
 0.10: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
 0.05: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
 0.01: 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000, 286523866955980473472416234029099106147699534057978681191611151248041351149397796686739190630049919580036652125108246636515974779229108363215505165150419224100864.000
  min:         79.029,      19294.069,         51.829,         51.829
End percentiles
OK!

No messages in dmesg.

ZaidQureshi commented 5 years ago

Have you been able to replicate the issue?

ZaidQureshi commented 5 years ago

As for the calculations of chunks, pages and threads: Yes, it definitively is a bit iffy and I haven't tested it properly and I suspect that there may be some bugs/issues with offsets there.

Are you using these for the creation of the queue memory? for the prp list memory?

By the way I am using the code from the master branch. Although i have tried the other branch, gives same affect.

enfiskutensykkel commented 5 years ago

Hi,

I am at my cabin right now so I don't have any access to hardware at the moment. I will be back at the office at the beginning of next week.

So when I run: ./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --count=1 --pages=1 --queue 'no=1,depth=32' --outer=1 --inner=5 --gpu=6 --verify --infile=Makefile

I get the following output:

It's good that the using a GPU with the verify option works, that rules out any issues with PCIe peer-to-peer. nvm-latency-bench defaults to using namespace NSID 1 as well.

Just to explain what's going on here:

So I'm fairly sure that at least that works.

If I remove the very and the infile I get the following output:

As for the weird percentiles output, that's a known bug. It's caused by an arithmetic overflow because the number of repetitions is too low. I'll fix this at some point, but it's just an annoyance so it's not a high priority.

Are you using these for the creation of the queue memory? for the prp list memory?

Yes. There might be something buggy going on there. But with chunks and pages set to 1, the calculation should be fairly straight forward despite any calculation errors. Have you tried setting the number of threads to 1?

Out of interest, what GPU and disk are you using? Maybe I can try to reproduce when I get back. Can you confirm that the GPU supports GPUDirect Async?

enfiskutensykkel commented 5 years ago

Is this issue by any chance related to your experience from #25 ?

ZaidQureshi commented 5 years ago

I have tried 1 thread 1 chunk and 1 page and that's fine.

I guess it's the same issue as #25

enfiskutensykkel commented 5 years ago

I see. I'll try to reproduce the issue later this week then. Thank you for reporting it. At this point, I believe some offset calculation is very likely the culprit.

enfiskutensykkel commented 5 years ago

I believe I have confirmed that there is some issue with illegal memory accessing for some parameters. I will try to look into it as soon as I am able to.

ZaidQureshi commented 5 years ago

If/when you know where this is happening (or with what structure) could you please let me know? Thanks.

ZaidQureshi commented 5 years ago

Hi,

Any updates on this? Thanks.

enfiskutensykkel commented 5 years ago

I'm currently unable to reproduce this, I've tried with the different combinations applied in #25 but it seems to work. However, I see that in my SISCI branch, I've made a restriction to only allow thread count as power of two. The bug I thought I was able to reproduce was something unrelated.

ZaidQureshi commented 5 years ago

Ok thank you for looking into it. Could it maybe be a BIOS feature I need to enable/disable?

ZaidQureshi commented 5 years ago

Other details that may be of use: I am running this on Linux kernel version 4.15 I have been able to reproduce this error on 2 machines (with 2 different SSDs with different controllers and 2 different GPUs). May I ask what SSD you use for testing?

enfiskutensykkel commented 5 years ago

I'll look into it some more, I haven't ruled out that there is some form of alignment/overlap issue that doesn't happen on my system but may happen on other systems. I'm not aware of any BIOS setting that might affect it.

In the past I've tested with Samsung Evo 960 and 970 in the past, some non-Optane Intel disks I don't recall the model names of ATM, Intel Optane 900P and Intel Optane 4800X. I've only used the two Optane disks when trying to reproduce this issue though, so I can try using one of the other disks.

I'll see what I can do in order to try to reproduce it, but I'm pretty swamped with other stuff the next couple of weeks. Just out of curiosity, have you tried both branches? You may have to run make clean and even cmake again after switching to the other branch.

ZaidQureshi commented 5 years ago

Yes I have tried both branches and I get the same result.

ZaidQureshi commented 5 years ago

One more question, what distro, kernel version, cuda version, and nvidia driver version do you use?

enfiskutensykkel commented 5 years ago

I've tested CentOS and Fedora in the past and with different CUDA versions (and it has worked), but I've tried replicating this issue using Ubuntu 18.04.2 with CUDA 10.1. The driver version is 418.40.04, as reported by nvidia-smi.

To clarify, it's just some combinations of arguments that hangs, right? Or does all hang now?

ZaidQureshi commented 5 years ago

Yes it is just some combinations, many others work. But I am concerned that since some combinations don't work, there is something fundamentally wrong going on (like some overlap or miscalculated indices)

enfiskutensykkel commented 5 years ago

I totally understand, and I will try to look into it when I have the time.

For those combinations that appear to work it should be possible to verify output from disk by using the --output (or maybe it's --outfile) option which will write what is read to file. You could then use another program such as nvm-latency-bench also with the --output option to read out a matching number of pages and write to a file. Note that in the sisci-5.11 branch, you need to comment back in the for-loop in the moveBytes function in main.cu (https://github.com/enfiskutensykkel/ssd-gpu-dma/blob/sisci-5.11/benchmarks/cuda/main.cu#L85)

While this does not guarantee that ranges aren't overlapping, it at least should provide some sort of confirmation that the entire range is covered and that all chunks are read at least once. Maybe not very assuring, but it at least confirms that data is read from the disk.

ZaidQureshi commented 5 years ago

So that is what I have been doing, i write data to the disk and then use the cuda benchmark with the output flag and I get correct values, at least as far as I remember, for the combinations that work. The data read is correct. I will do some more extensive testing over the weekend.

ZaidQureshi commented 5 years ago

I tested the output for configs that work and the output seems to be correct.

Is there a limit on how many dma mappings can be creating using GPUDirect?

enfiskutensykkel commented 5 years ago

There's no limitation on the number of mappings, but most GPUs have a limitation on how much memory can be "pinned" and exposed for third-party devices. This limitation is usually around 128-256 MB. Depending on the GPU, there is also the memory alignment requirement (that pinned memory must be 64 K aligned).

ZaidQureshi commented 5 years ago

The limitation should be what is reported by the nvidia-smi -q command, according to the following link, right?: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#display-bar-space

enfiskutensykkel commented 5 years ago

That's correct. You can also see the BAR size using lspci.

ZaidQureshi commented 5 years ago

I am using a volta gpu for my testing, if that matters.

Also, when the SSD does and DMA into GPU memory, does it invalidate any cached lines (for the region being written to) in the GPU's caches?