Closed mayureshw closed 7 months ago
Yeah makes sense. The first issue is "TLP overhead"- likely the Nitefury and your system are using the smallest TLP size, and thus will incur significant overhead there. And since you are looping back in the device, and waiting for the responses, you'll pay this overhead twice.
Second, PCIe is a "high latency, high thruput" type of device. So the main issue is really the latency- since your test does a loop in the device, the thruput is primarily limited by the latency of initiating a transaction.
The best way to measure thruput is to measure the time in each direction separately
There is a dd command one in each direction. Their combined output looks like this:
1000000+0 records in
1000000+0 records out
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB, 3.8 GiB) copied, 59.7542 s, 68.5
MB/s4096000000 bytes (4.1 GB, 3.8 GiB)
copied, 59.7542 s, 68.5 MB/s
BS = 4096 68.5 MB/s BS = 512 9.2 MB/s BS=8 147 kB/s
And if I continue increasing the block size, at about 1MB the throughput peaks at about 1.2GB/s. Any further increase in the block size brings the throughput slightly down.
Device MaxPayload is 512 bytes. Wonder why it takes block size of 1MB to peak the bandwidth.
Closing this with learning that block size of 1MB peaks the PCIe bandwidth. This may probably depend on the host machine.
On nitefury device, using xdma ip I have created a loopback design, where PCIe in loops back to PCIe out via AXI loopback.
Observed speeds seem to vary greatly depending on the block size of the transfer, but using MaxPayload size mentioned in xdma up of 4K bytes, it is still way below the theoretical speed of Gen 2 link.
For a script like above I get the following outputs
BS = 4096 68.5 MB/s BS = 512 9.2 MB/s BS=8 147 kB/s