turjanica-AS commented 2 years ago

So I have a 2 NT partition setup and am trying to send data using ntb_transport/ntb_netdev. While running qperf, or even a ftp transfer, I am only getting around 15-16 MB/s. I'm not sure what is causing such a slow data rate. I'm on Ubuntu 20.04 kernel 5.13.

lsgunth commented 2 years ago

That's an order of magnitude slower than I'd expect, even at it's slowest.

Are you using a dma engine? Assuming your hardware supports it? Without that, the software has to copy over IO memory with memcpy. There's lots of things that could slow that down.

Also, I know if you setup and use ntb_msi it improves the speed significantly, but you should be able to get much higher without it.

turjanica-AS commented 2 years ago

I did not have the dma drivers running previously. But I loaded them and still got 16 MB/s.

I ran the ntb_perf tool and got ~60Gb/s (non DMA) and ~113Gb/s (DMA).

So it's something to do with the ntb_netdev driver when trying to do TCP/IP?

lsgunth commented 2 years ago

Yeah, I'm not sure. I remember it being a bit sensitive to message sizes, Especially without the ntb_msi improvement. I never used qperf but is it possible it's running on very small messages? The testing I did was with iperf. I do remember doing some tuning to increase the mtu and send larger packages which helped some.

epilmore commented 2 years ago

On Wed, Feb 16, 2022 at 8:35 AM accipitersystems-Turjanica @.***> wrote:

I did not have the dma drivers running previously. But I loaded them and still got 16 MB/s.

I ran the ntb_perf tool and got ~60Gb/s (non DMA) and ~113Gb/s (DMA).

So it's something to do with the ntb_netdev driver when trying to do TCP/IP?

I would suggest looking at the ntb_transport module parameter "copy_bytes" in: /sys/module/ntb_transport/parameters/copy_bytes.

May want to also look at: /sys/module/ntb_transport/parameters/transport_mtu

The "copy_bytes" parameter defines a threshold below which the data will be moved with simple memcpy, but above the threshold it will utilize the DMA engine. The ntb_netdev module utilizes ntb_transport to implement the QPs used for communication.

Eric

epilmore commented 2 years ago

On Wed, Feb 16, 2022 at 9:19 AM Eric Pilmore @.***> wrote:

On Wed, Feb 16, 2022 at 8:35 AM accipitersystems-Turjanica @.***> wrote:

I did not have the dma drivers running previously. But I loaded them and still got 16 MB/s.

I ran the ntb_perf tool and got ~60Gb/s (non DMA) and ~113Gb/s (DMA).

So it's something to do with the ntb_netdev driver when trying to do TCP/IP?

I would suggest looking at the ntb_transport module parameter "copy_bytes" in: /sys/module/ntb_transport/parameters/copy_bytes.

May want to also look at: /sys/module/ntb_transport/parameters/transport_mtu

The "copy_bytes" parameter defines a threshold below which the data will be moved with simple memcpy, but above the threshold it will utilize the DMA engine. The ntb_netdev module utilizes ntb_transport to implement the QPs used for communication.

Eric

I should also point out that ntb_perf does NOT utilize ntb_transport, and so is governed by its own parameters with respect to memcpy vs DMA when moving data.

Eric

jborz27 commented 2 years ago

I did not have the dma drivers running previously. But I loaded them and still got 16 MB/s.

I ran the ntb_perf tool and got ~60Gb/s (non DMA) and ~113Gb/s (DMA).

So it's something to do with the ntb_netdev driver when trying to do TCP/IP?

Did you also enable DMA on ntb_transport? Either way In my experience I found ntb_perf more useful since it doesn't have the TCP/IP layer which has inefficiencies that need to be accounted for. For example optimal TCP window size and options like use of zero buffer copy can make a difference in performance.

turjanica-AS commented 2 years ago

Thanks that's all very helpful.

I have run both qperf and iperf3 with the same ~16MB/s results.

I updated my parameters use_dma: N->Y copy_bytes: 1024->0 max_mw_size:0->0x80000000000000000

But no changes in speed, still getting 16MB/s.

Just as a note, I am trying to use ntb_netdev for the TCP/IP stack as we want that capability so existing applications will not have to be changed. And yes I realize ntb_perf will be using different parameters than ntb_transport.

lsgunth commented 2 years ago

What do you see in cat /sys/kernel/debug/ntb_transport//qp0/stats?

epilmore commented 2 years ago

On Thu, Feb 17, 2022 at 12:59 PM accipitersystems-Turjanica @.***> wrote:

Thanks that's all very helpful.

I have run both qperf and iperf3 with the same ~16MB/s results.

I updated my parameters use_dma: N->Y copy_bytes: 1024->0 max_mw_size:0->0x80000000000000000

But no changes in speed, still getting 16MB/s.

Just as a note, I am trying to use ntb_netdev for the TCP/IP stack as we want that capability so existing applications will not have to be changed. And yes I realize ntb_perf will be using different parameters than ntb_transport.

I'm assuming that value for max_mw_size is a typo since it is 68-bits long!

Maybe try running traceroute on your IP interface for the NTB.

Eric

turjanica-AS commented 2 years ago

On Thu, Feb 17, 2022 at 12:59 PM accipitersystems-Turjanica @.***> wrote: Thanks that's all very helpful. I have run both qperf and iperf3 with the same ~16MB/s results. I updated my parameters use_dma: N->Y copy_bytes: 1024->0 max_mw_size:0->0x80000000000000000 But no changes in speed, still getting 16MB/s. Just as a note, I am trying to use ntb_netdev for the TCP/IP stack as we want that capability so existing applications will not have to be changed. And yes I realize ntb_perf will be using different parameters than ntb_transport. I'm assuming that value for max_mw_size is a typo since it is 68-bits long! Maybe try running traceroute on your IP interface for the NTB. Eric

Yes, sorry a typo, only supposed to be 15 0s, not 16. Traceroute shows Server1: traceroute to 192.1.1.11 (192.1.1.11), 30 hops max, 60 byte packets 1 192.1.1.11 (192.1.1.11) 0.972 ms 0.957 ms 8.229 ms Server2: traceroute to 192.1.1.10 (192.1.1.10), 30 hops max, 60 byte packets 1 192.1.1.10 (192.1.1.10) 0.715 ms 0.699 ms 6.648ms

turjanica-AS commented 2 years ago

What do you see in cat /sys/kernel/debug/ntb_transport//qp0/stats?

Below is the output of stats.

root@carson-server1:/sys/kernel/debug/ntb_transport/0000:01:00.0/qp0# cat stats

NTB QP stats:

rx_bytes - 1473583413 rx_pkts - 23085 rx_memcpy - 23085 rx_async - 0 rx_ring_empty - 28521 rx_err_no_buf - 0 rx_err_oflow - 0 rx_err_ver - 0 rx_buff - 0x00000000ed26dcee rx_index - 6 rx_max_entry - 7 rx_alloc_entry - 100

tx_bytes - 790495 tx_pkts - 11850 tx_memcpy - 11850 tx_async - 0 tx_ring_full - 0 tx_err_no_buf - 0 tx_mw - 0x00000000218a732e tx_index (H) - 6 RRI (T) - 5 tx_max_entry - 7 free tx - 6

Using TX DMA - No Using RX DMA - No QP Link - Up

lsgunth commented 2 years ago

So, you're still not using a dma engine. It's all being memcpy'd.... Perhaps there is no dma engine? Do you see anything in /sys/class/dma? What kind of CPU do you have?

turjanica-AS commented 2 years ago

So yes, I had forgotten to load the dma engine. Although with it loaded it still shows No.

root@carson-server1:/sys/module/ntb_transport/parameters# cat /sys/kernel/debug/ntb_transport/0000\:01\:00.0/qp0/stats

NTB QP stats:

rx_bytes - 19618 rx_pkts - 128 rx_memcpy - 128 rx_async - 0 rx_ring_empty - 256 rx_err_no_buf - 0 rx_err_oflow - 0 rx_err_ver - 0 rx_buff - 0x0000000063c68030 rx_index - 2 rx_max_entry - 7 rx_alloc_entry - 100

tx_bytes - 21965 tx_pkts - 130 tx_memcpy - 130 tx_async - 0 tx_ring_full - 0 tx_err_no_buf - 0 tx_mw - 0x00000000409f3664 tx_index (H) - 4 RRI (T) - 3 tx_max_entry - 7 free tx - 6

Using TX DMA - No Using RX DMA - No QP Link - Up

root@carson-server1:/sys/module/ntb_transport/parameters# ls /sys/class/dma dma0chan0 dma0chan1 dma1chan0 dma1chan1 dma2chan0 dma2chan1 dma2chan2 dma2chan3

root@carson-server2:/sys/module/ntb_transport/parameters# ls /sys/class/dma dma0chan0 dma0chan1 dma1chan0 dma1chan1

epilmore commented 2 years ago

On Thu, Feb 17, 2022 at 1:56 PM accipitersystems-Turjanica @.***> wrote:

So yes, I had forgotten to load the dma engine. Although with it loaded it still shows No.

The DMA engine is acquired when the ntb_transport QP is created. You'll want to bring down the network interface, and possibly just unload and reload the ntb_netdev module, on both ends. This will cause the QP to get recreated and it should pick up the now-present DMA engine.

Eric

turjanica-AS commented 2 years ago

On Thu, Feb 17, 2022 at 1:56 PM accipitersystems-Turjanica @.***> wrote: So yes, I had forgotten to load the dma engine. Although with it loaded it still shows No. The DMA engine is acquired when the ntb_transport QP is created. You'll want to bring down the network interface, and possibly just unload and reload the ntb_netdev module, on both ends. This will cause the QP to get recreated and it should pick up the now-present DMA engine. Eric

So I just went back and took everything down to try and bring it back up in a proper order. My one server looks good dmesg shows "switchtec 0000:01:00.0: Using DMA memcpy for TX" and RX. But when I try and load the second server with ntb_transport it gets stuck and shows a dmesg error.... Software Queue-Pair Transport over NTB, version 4 BUG: unable to handle page fault for address: ffffbc8dc1e37074

PF: supervisor read access in kernel mode

PF: error code (0x0000) - not-present page

PGC 100000067 P4D 1000000067 PUD 1001d6067 PMD 119c53067 PTE 0 Oops: 0000 [#1] SMP NOPTI CPU: 3 PID: 191 Comm: modprobe Tainted: G OE 5.13.0-30-generic #33~20.04.1-Ubuntu

There is also a scenario when it shows, but in this scenario it allows me to start the network interface, but not send pings. This error comes right after switchtec: eth0 created. Happens when I load the ntb_netdev on the other server.

CPU: 7PID: 172 Comm: kworker/7:1 Tainted: G OE

turjanica-AS commented 2 years ago

So, you're still not using a dma engine. It's all being memcpy'd.... Perhaps there is no dma engine? Do you see anything in /sys/class/dma? What kind of CPU do you have?

i7-11700 CPU.

Yes I was able to see something in /sys/class/dma. My response to epilmore shows more about what I'm seeing in dmesg now.

lsgunth commented 2 years ago

What device provides the DMA engines? Is it part of the CPU or is this something else? I thought only Xeon CPUS had DMA engines in them; but I could be wrong about that.

turjanica-AS commented 2 years ago

What device provides the DMA engines? Is it part of the CPU or is this something else? I thought only Xeon CPUS had DMA engines in them; but I could be wrong about that.

I'm using a PFX switchtec chip which has the hardware DMA engine, and the switchtec-dma kernel module exports the access to the Switchtec DMA engine to the upper layer host software.

jborz27 commented 2 years ago

Yeah, it should be the Switchtec DMA. That's what ntb_perf uses when the Switchtec DMA driver is loaded.

turjanica-AS commented 2 years ago

Yea, which is strange that ntb_perf is having no issues but ntb_netdev/transport is.

Another wrinkle to the above error I had shown. The one server shows the above error and the other server shows a repeat of the dmesg...

switchtec 0000:01:00.0: Remote version = 0

Hundreds of them, so something is messed up with the switchtec-kernel?

jborz27 commented 2 years ago

That log originates from ntb_transport.c:

/* Query the remote side for its info */
val = [ntb_spad_read];
dev_dbg(&pdev->dev, "Remote version = %d\n", val);
if (val != NTB_TRANSPORT_VERSION)
        goto out;

lsgunth commented 2 years ago

Oh, hmm. Rather sounds like a bug in the switchtec-dma module. I'm not all that familiar with it. Maybe post the full BUG, including the full traceback?

epilmore commented 2 years ago

On Fri, Feb 18, 2022 at 2:04 PM Logan Gunthorpe @.***> wrote:

Oh, hmm. Rather sounds like a bug in the switchtec-dma module. I'm not all that familiar with it. Maybe post the full BUG, including the full traceback?

I don't think the Switchtec DMA will be generally compatible with ntb_transport usage. The Switchtec DMA requires that the Source and Destination addresses have the exact same word-byte alignment. If your Source/Dest are always on (at least) word boundaries, then you should be good to go, but if not, then the DMA engine will not be happy. In our usage of Switchtec DMA, we've only been able to leverage it for usages where there are "block" transfers happening, where the Source/Dest addresses are definitely at least word aligned. In ntb_netdev/ntb_transport usage, you don't really have a guarantee that both Source and Dest addresses will be aligned equally, i.e. (S=0x0,D=0x0), (S=0x1,D=0x1), (S=0x2,D=0x2), (S=0x3,D=0x3).

Eric

turjanica-AS commented 2 years ago

On Fri, Feb 18, 2022 at 2:04 PM Logan Gunthorpe @.***> wrote: Oh, hmm. Rather sounds like a bug in the switchtec-dma module. I'm not all that familiar with it. Maybe post the full BUG, including the full traceback? I don't think the Switchtec DMA will be generally compatible with ntb_transport usage. The Switchtec DMA requires that the Source and Destination addresses have the exact same word-byte alignment. If your Source/Dest are always on (at least) word boundaries, then you should be good to go, but if not, then the DMA engine will not be happy. In our usage of Switchtec DMA, we've only been able to leverage it for usages where there are "block" transfers happening, where the Source/Dest addresses are definitely at least word aligned. In ntb_netdev/ntb_transport usage, you don't really have a guarantee that both Source and Dest addresses will be aligned equally, i.e. (S=0x0,D=0x0), (S=0x1,D=0x1), (S=0x2,D=0x2), (S=0x3,D=0x3). Eric

Hmm then is there something you would suggest to move data using TCP/IP, if netdev/transport doesn't play well.

Not sure if this also answers the question of why I'm getting 16MB/s using non-dma netdev/transport.

If you would like, here are the outputs and what I was doing before the error. Did full setup on Host 1 first then started Host2. Host 1 eventually freezes, mouse moves but won't take any input. Netdev_DMAChannel_Setup_Remote_Version_Host2.txt Netdev_DMAChannel_walkthrough_Host1.txt

epilmore commented 2 years ago

On Fri, Feb 18, 2022 at 2:46 PM accipitersystems-Turjanica @.***> wrote:

Hmm then is there something you would suggest to move data using TCP/IP, if netdev/transport doesn't play well.

Across NTB, ntb_netdev/ntb_transport, is your only option short of writing your own version, although that won't solve your problem anyway. The issue is not so much the fault of ntb_netdev/ntb_transport, but rather the nature of data going through the general Linux netdev (TCP/IP) stack, i.e. skb's. Most DMA engines nowadays don't have this alignment restriction that the Switchtec DMA does.

If your CPU is an Intel, then you might have IOAT available. Or if it is AMD, it also comes with some embedded DMA engines as part of its Crypto engine. I presume you have a host bus adapter card that connects your host to the Switchtec switch? Possibly any DMA engines available on that card?

Not sure if this also answers the question of why I'm getting 16MB/s using non-dma netdev/transport.

If a DMA engine is not available, then the data is moved by the CPU with good old fashioned memcpy(). Depending on the platform, memcpy() could be optimized to leverage possible vector registers that effectively allow more data to be moved per CPU operation versus just a simple 4 or 8 bytes per "store" operation. I don't recall what CPU you are using. Even for a CPU memcpy, 16MB/s seems low, but I haven't measured it lately to know what a reasonable range is to expect.

Eric

lsgunth commented 2 years ago

memcpy can perform worse when dealing with uncached IO memory like this application. I remember a long time ago having abysmal performance as the kernel had optimize for size set and memcpy was therefore copying one byte at a time, which meant one TLP on the PCI bus per byte. Not efficient.

turjanica-AS commented 2 years ago

On Fri, Feb 18, 2022 at 2:46 PM accipitersystems-Turjanica @.***> wrote: Hmm then is there something you would suggest to move data using TCP/IP, if netdev/transport doesn't play well. Across NTB, ntb_netdev/ntb_transport, is your only option short of writing your own version, although that won't solve your problem anyway. The issue is not so much the fault of ntb_netdev/ntb_transport, but rather the nature of data going through the general Linux netdev (TCP/IP) stack, i.e. skb's. Most DMA engines nowadays don't have this alignment restriction that the Switchtec DMA does. If your CPU is an Intel, then you might have IOAT available. Or if it is AMD, it also comes with some embedded DMA engines as part of its Crypto engine. I presume you have a host bus adapter card that connects your host to the Switchtec switch? Possibly any DMA engines available on that card? Not sure if this also answers the question of why I'm getting 16MB/s using non-dma netdev/transport. If a DMA engine is not available, then the data is moved by the CPU with good old fashioned memcpy(). Depending on the platform, memcpy() could be optimized to leverage possible vector registers that effectively allow more data to be moved per CPU operation versus just a simple 4 or 8 bytes per "store" operation. I don't recall what CPU you are using. Even for a CPU memcpy, 16MB/s seems low, but I haven't measured it lately to know what a reasonable range is to expect. Eric

So would the solution be to modify switchtec-dma? To then make it act like most current dma engines. That seems like the less hassle than writing new netdev/transport.

Our CPU is a i7-11700, and we are using the switchtec development board with the ADP_EDGE adapters, so no dma engine on the NICS. This dev board is being used until our HW NICs are built, but those NICs will just have a PFX on them.

epilmore commented 2 years ago

On Mon, Feb 21, 2022 at 9:00 AM accipitersystems-Turjanica @.***> wrote:

So would the solution be to modify switchtec-dma? To then make it act like most current dma engines. That seems like the less hassle than writing new netdev/transport.

Our CPU is a i7-11700, and we are using the switchtec development board with the ADP_EDGE adapters, so no dma engine on the NICS. This dev board is being used until our HW NICs are built, but those NICs will just have a PFX on them.

Modifying switchtec-dma will not help because it is a hardware limitation.

Rewriting ntb_netdev/ntb_transport will not help because the issue is generally in the Linux Netdev stack, and rewriting that is not practical. Furthermore, the necessary changes to conform to the Switchtec DMA alignment requirements would likely hamper overall performance anyway.

BTW, I do NOT claim to be an expert on the intimate details of the Linux Netdev stack. It may be possible that there is a knob that might force a data alignment on SKBuff's such that they could satisfy the Switchtec DMA hardware restrictions, but I'm not aware of what that knob is or whether one even exists.

Bottom line, I think you may be screwed. Since your server does not have built-in DMA engines, if all the stuff you're doing is PCIe Gen3, you can maybe consider the Dolphin PXH832 host bus adapter. You could probably enable the PLX DMA engines on that device and use those (plx_dma driver is in Linux courtesy of Logan!). Sorry, the NTB stuff is cool and interesting, but to really derive the benefit, you need somebody to actually push the data down the wire!

Unless somebody else has some bright ideas!

Eric

Microsemi / switchtec-kernel

Slow Data Rate #113

PF: supervisor read access in kernel mode

PF: error code (0x0000) - not-present page