Query: (1) why throughput higher than theoretical, (2) why lower throughput with 128B payload.

108anup commented 2 years ago

I am not sure what is the best place to ask this question, hence asking here as an issue. Please let me know if some other place is preferred over this.

Q1: I get cases where observed throughput is higher than theoretical maximum, e.g., 97 Gbps throughput with 1472B payload where the theoretical is around 95 Gbps. 2 Gbps is large enough difference that I can't explain using a few cycle measurement mismatch. I noticed the recently changed juptyer notebook also shows this but the mismatch is only 0.0005 Gbps or so. What might be the reasons for higher than theoretical throughput measurement. If we under measure the time to receive packets by even 100 cycles, that may explain 0.0005 Gbps mismatch but not 2 Gbps mismatch.

Q2. I observe relatively lower throughput with 128B payload. Why might that be? Is it to do with bytes transmitted per flit? If so, 64B payload then should also have similar issue (55 bytes per flit). But we see that only 128B payload has ~5 Gbps gap to theoretical maximum, while other payload sizes don't have much gap.

Calculation: Flit is AXI transfer data width (64B). For 128B payload, frame size = 128 + UDP (8), IP (20), Ethernet(14) and FCS (4) = 174B, requires 3 flits (3 * 64B >= 174B). Or 174/3 bytes sent per flit = 58. Similarly for 64B payload, bytes per flit is 55. For other payload sizes bytes per flit is >= 60.

mariodruiz commented 2 years ago

Hi,

Under what conditions are you getting this large mismatch? The CMAC should limit the maximum tx/rx bandwidth. However, if the burst is short, the FIFOs in the datapath can absorb this difference.
There's an overhead in the HLS IP which is impacts significantly segments in the range of 65 to 160 bytes.

In the throughput notebooks the overhead is shown:

# overhead is UDP (8), IP (20), Ethernet(14), FCS (4), IFG (12), preamble (7), start frame delimiter (1)
overhead = 8 + 20 + 14 + 4 + 12 + 7 + 1

108anup commented 2 years ago

I am using the provided notebook. I change MAC/IP address configuration and kernel configuration. I don't make any changes to the FPGA code, i.e., use vanilla xclbin. The TX/RX measurement is done on a single FPGA instead of 2 different FPGAs. The benchmarking FPGA sends packets to a device-under-test (DUT, running open NIC). The DUT forwards packets back to the benchmarking FPGA. The benchmarking FPGA has 2 kernels started (one in PRODUCER and other in CONSUMER mode). The PRODUCER kernel continuously sends packets (not in burst). 1.1. For 64B UDP payload (174B frame or packet), the TX throughput itself is around 1.163 Gbps above theoretical. This difference is at the frame level. At the application (payload), the difference is 0.67 Gbps above theoretical. 1.3 I.e., for 64B payload, throughput reported by notebook = 49.907 Gbps, frame_level_throughput = 85.777 Gbps. Theoretical maximum throughput = 49.231 Gbps at application level and = 84.651 Gbps at frame level. 1.2 (frame_level_throughput = application_level_throughput * (payload + 46) / payload).
64 byte payload (frame size = 110B) should also be impacted by this overhead right. But only 128 byte payload (frame size = 174B) is impacted. From the notebook in upstream code, the throughput difference to theoretical max is 4.6 Gbps for 128B payload (174B segment or frame) while it is 0 for 64B payload (110B segment or frame). Segment or frame refer to bits transferred in one AXI stream transaction. 3.1. From what I understand, based on the reasoning you give, the impact should be witnessed by 64B payload not 128B payload as 64B payload creates segments of size 110B which lie in the range 65 to 160 bytes.

108anup commented 2 years ago

To double check, I can run the vanilla un-modified VNx notebook to see if the difference is still there. Since the difference is in TX throughput itself, I would expect the same difference to still be there as TX should not be affected by DUT much. Increase in RX throughput could be maybe from packet duplication (though I have no reason to believe that packets would be duplicated).

mariodruiz commented 2 years ago

1.1. For 64B UDP payload (174B frame or packet)

How do you get the 174B in here?

Each of the individual IP that compose the network layer needs one or two clock cycles extra to process each packet/segment. I suppose that for 128-Byte the extra cycles stack impacting the throughput. This is something I haven't profiled, because for bulk data transfer you will not use small packet size.

The low performance for small packet is known. Given the current design, this is unavoidable.

Mario

108anup commented 2 years ago

Sorry typo there. For 64B payload (64+46=110B frame)

108anup commented 2 years ago

I understand that performance will be low for small packets. The thing I don't understand is why performance for 64B payload is better than performance for 128B payload. I get better throughput with 64B payloads.

Throughput I measure for payload level using provided notebook is 49.907 Gbps for 64B and 60.21 Gbps for 128B. If we translate these to frame level, using frame_level_throughput = payload_level_throughput * (payload size + header size) / (payload size), we get throughput as 85.77 Gbps for 64B payload (110B frame) and 81.84 Gbps for 128B payload, i.e., packet level throughput is higher for 64B payloads.

Even in the provided notebook, the throughput with 128B payloads is 5 Gbps lower than theoretical while for 64B payloads, it is close to theoretical, i.e., the efficiency for 64B payloads (smaller payload) is better than that with 128B payloads (larger payload).

108anup commented 2 years ago

Also any thoughts on why TX throughput might be higher than theoretical?

mariodruiz commented 2 years ago

This is how I compute these throughputs

udp = 8
ip = 20
eth = 14
fcs = 4
ifg = 12
pp_amble = 8

def thr(payload_size: int):
    total_bytes = payload_size + udp + ip + eth + fcs + ifg + pp_amble
    payload_thr = payload_size / total_bytes
    frame_thr = (payload_size + ip + udp + eth) / total_bytes
    return payload_thr * 100 , frame_thr * 100.0

So, thr(64) = (49.23076923076923, 81.53846153846153) and thr(128) = (65.97938144329896, 87.62886597938144)

I think your theoretical equation does not look right, as the payload (segment) size increases, the overhead decreases (the deficiency increases).

Correct me if I am missing something.

mariodruiz commented 2 years ago

Someone already asked about this here, I suppose your theoretical throughput is what this person calls naked cmac, still does not mach the numbers I showed above. But, I believe the Python snippet is the correct way to compute this.

108anup commented 2 years ago

My defs are same except: frame_thr = (payload_size + ip + udp + eth + *fcs*) / total_bytes Frame is what goes into tdata of the AXIS interfaces, i.e., payload + 46 bytes. This is also why tkeep in LATENCY kernel is 18 = 64 - 46, i.e., 18B payload gives frame of size 1 flit (TDATA WIDTH).

For the small packets, the issue is not in theoretical computation. The issue is that the measured frame throughput for 128B payload is smaller than 64B payload (in my setup). Even in the upstream notebook, the measured payload throughput for 128B payload is further away from theoretical (4 Gbps away, 65 Gbps theoretical, 61 Gbps measured) than 64B payload (0 Gbps difference, measured and theoretical = 49.231 Gbps). Even here, the measured L2 bandwidth is higher for 110B packet (95.45) compared to 174B (93.84).
For theoretical calculations (at least for theoretical throughput at payload level), I use exact same calculation as yours. I am seeing higher measured throughput than theoretical throughput at payload level of around 0.6 to 1.3 Gbps.

Payload | Theoretical Gbps at payload level | Measured Gbps at payload level | Difference -- | -- | -- | -- 64 | 49.23076923 | 49.90756385 | -0.6767946191 128 | 65.97938144 | 60.21115183 | 5.768229612 192 | 74.41860465 | 75.26383693 | -0.8452322772 256 | 79.50310559 | 80.59648509 | -1.093379504 320 | 82.9015544 | 84.04169256 | -1.140138154 384 | 85.33333333 | 86.50693924 | -1.173605905 448 | 87.15953307 | 88.35825842 | -1.198725344 512 | 88.58131488 | 89.79962376 | -1.218308883 576 | 89.71962617 | 90.95361477 | -1.233988597 640 | 90.65155807 | 91.89833917 | -1.246781098 704 | 91.42857143 | 92.68604922 | -1.257477787 768 | 92.08633094 | 93.35285902 | -1.266528087 832 | 92.65033408 | 93.92464052 | -1.274306441 896 | 93.13929314 | 94.42033049 | -1.281037355 960 | 93.56725146 | 94.85417209 | -1.286920625 1024 | 93.94495413 | 95.23706712 | -1.292112994 1088 | 94.28076256 | 95.57749555 | -1.296732985 1152 | 94.58128079 | 95.88214486 | -1.300864075 1216 | 94.85179407 | 96.15638323 | -1.304589155 1280 | 95.09658247 | 96.40453013 | -1.307947665 1344 | 95.31914894 | 96.630162 | -1.311013068 1408 | 95.52238806 | 96.83620318 | -1.313815124 1472 | 95.70871261 | 97.02508574 | -1.316373129

In above table the measured at payload level is directly reported by the notebook, i.e., from ol_w0_tg.compute_app_throughput('tx') . The theoretical is taken from what you described here, i.e., payload_thr = 100 * payload_size / (payload_size + udp + ip + eth + fcs + ifg + pp_amble).

In this table, there are 2 weird things,

The behavior for 128B payload is different compared to smaller and larger payloads (difference = +5.76Gbps vs -0.67 Gbps (64B) and -0.84 Gbps (192B).
There is significantly larger than theoretical throughput (e.g., 97.02 Gbps measured vs 95.708 Gbps theoretical for 1472B payload).

mariodruiz commented 2 years ago

How many packets are you sending for each payload size?

128-Byte is always going to give the worst efficiency. Without redesigning the whole network layer, this cannot be solved.
There's a small chance that the CMAC is overclocked. This only could happen when connected to another Alveo card. You can try to run the same experiments when VNx is connected to a different network equipment (no Alveo).

108anup commented 2 years ago

For the above table, I sent 1 billion packets (basically the PRODUCER is set as in the notebook). Results are similar for 1 million packets as well.

Could you please elaborate why 128B is worst than 64B payload.
Hmm.., so the QSFP and wire support more than 100Gbps? In my setup, the FPGAs are connected to a 100G switch.

mariodruiz commented 2 years ago

I mentioned above, each IP need at least one more cycle to process each segment. For 128-Byte, these stack creating the highest overhead.
The QSFP28 has four line that can work up to 28 Gbps each.

Xilinx / xup_vitis_network_example

Query: (1) why throughput higher than theoretical, (2) why lower throughput with 128B payload. #81