ZaidQureshi / bam

BSD 2-Clause "Simplified" License
132 stars 33 forks source link

the BaM bandwidth is stopped to increase when the number of NVMe is more than 7 #17

Open LiangZhou9527 opened 1 year ago

LiangZhou9527 commented 1 year ago

Hi there,

I'm doing benchmark testing on my machine which is configured with some H800 GPUs and 8 NVMe storages dedicated for the BaM.

The GPU is configured with PCIe5 x16 and the NVMe storage is configured with PCIe4 x4, which means in theory the max bandwidth of GPU is around 60 GBps and the max bandwidth of single NVMe storage is around 7.5 GBps.

But according to my testing using "nvm-block-bench", the result is not as expected. I summary thge result here: https://raw.githubusercontent.com/LiangZhou9527/some_stuff/8b48038465858846f864e43cef6d0e6df787a2c2/BaM%20bandwidth%20and%20the%20number%20of%20NVMe.png

In the pciture we can see that the bandwidth with 6 NVMe and 7 NVMe is almost the same, but when the number of NVMe reaches 8, the bandwitdh is dropped a lot.

Any thoughts about what happens here?

BTW, I didn't enable IOMMU on my machine, and the benchmark testing cmdline is as below (I executed the command 8 times, each time with different --n_ctrls value, say, 1, 2 ... 8)

./bin/nvm-block-bench --threads=262144 --blk_size=64 --reqs=1 --pages=262144 --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=128 --random=true -S 1 --n_ctrls=1

msharmavikram commented 1 year ago

Wow. There are two awesome news here.

We have not tested hopper generation or gen5 CPU yet and we are super excited to see benchmark and bring up working out of the box. Thanks for giving this first awesome news!

We are delighted to see linear scaling upto 5 SSDs. Agreed it is lower and not scaling but this is first gen5 platform results we are aware of. So thanks for second awesome news. Anyway, we faced similar trends when we moved from gen3 to gen4 and we want to help you out debug this issue better. We likely will not get access to gen5 platform immediately and hence can we schedule call to discuss what can be done (I believe you know my email address.)?

We have bunch of theories and only way to determine what may be going wrong is validating each of them. Iommu definitely is one of the culprit here but we require to understand the pcie topology and capabilities of the gen5 root complex. Previously we had faced issues where CPU was wrongly configured to handle such high throughput and we need to understand if that is not the case. There is a bit of debug for gen5 platform to be done and we want to help here!

Lastly, can you try the following-

./bin/nvm-block-bench --threads=$((1024*1024)) --blk_size=64 --reqs=1 --pages=$((1024*1024)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=128 --random=true -S 1 --n_ctrls=1

I'm curious to see if latency is an issue here.

LiangZhou9527 commented 1 year ago

Hi @msharmavikram ,

We likely will not get access to gen5 platform immediately and hence can we schedule call to discuss what can be done (I believe you know my email address.)?

Much appreciated that you lend me a hand about this issue, yes I know your email address and I'm very happy to schedule a call when more info is available and clear.

Iommu definitely is one of the culprit here but we require to understand the pcie topology and capabilities of the gen5 root complex.

I didn't enable IOMMU in my host, there's nothing output when I run command "cat /proc/cmdline | grep iommu". And I also attached the pcie topo which is collected by running command "lspci -tv" and "lspci -vv", please refer to "lspci -tv" and "lspci -vv" .

Lastly, can you try the following- ./bin/nvm-block-bench --threads=$((10241024)) --blk_size=64 --reqs=1 --pages=$((10241024)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=128 --random=true -S 1 --n_ctrls=1

please refer to the output below:

SQs: 135 CQs: 135 n_qps: 128 n_ranges_bits: 6 n_ranges_mask: 63 pages_dma: 0x7f9540010000 21a020410000 HEREN Cond1 100000 8 1 100000 Finish Making Page Cache finished creating cache 0000:18:00.0 atlaunch kernel Elapsed Time: 686253 Number of Ops: 1048576 Data Size (bytes): 4294967296 Ops/sec: 1.52797e+06 Effective Bandwidth(GB/S): 5.82876

msharmavikram commented 1 year ago

Will look forward to your email.

Meanwhile can you try one more command and increase number of SSDs from 1 to 8 (below one is for 8 SSDs)

./bin/nvm-block-bench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=135 --random=true -S 1 --n_ctrls=8
LiangZhou9527 commented 1 year ago

Hi @msharmavikram ,

Here's the log from 1 to 8, the result is simiar as what I summaried before.

https://raw.githubusercontent.com/LiangZhou9527/some_stuff/main/1-8.log

Please note, this line "in Controller::Controller, path = /dev/libnvm0" is for debug only, it will not impact the performance result.

msharmavikram commented 1 year ago

I believe this is Intel SSDs. At least that's how it looks like. What are the max iopa for 4kb and 512B accesses ?

The issue seems to be from the iommu/pcie switch or CPU. We want to determine if the issue is bandwidth or iops. Let's try 1 to 8 SSDs configuration with page_size=512 instead of 4kb.

Let's see what it shows.

(Reach out in email as we might require additional support from vendors here - broadcom, Intel. ).