Open 108anup opened 2 years ago
Hi,
In the throughput notebooks the overhead is shown:
# overhead is UDP (8), IP (20), Ethernet(14), FCS (4), IFG (12), preamble (7), start frame delimiter (1)
overhead = 8 + 20 + 14 + 4 + 12 + 7 + 1
To double check, I can run the vanilla un-modified VNx notebook to see if the difference is still there. Since the difference is in TX throughput itself, I would expect the same difference to still be there as TX should not be affected by DUT much. Increase in RX throughput could be maybe from packet duplication (though I have no reason to believe that packets would be duplicated).
1.1. For 64B UDP payload (174B frame or packet)
How do you get the 174B in here?
Each of the individual IP that compose the network layer needs one or two clock cycles extra to process each packet/segment. I suppose that for 128-Byte the extra cycles stack impacting the throughput. This is something I haven't profiled, because for bulk data transfer you will not use small packet size.
The low performance for small packet is known. Given the current design, this is unavoidable.
Mario
Sorry typo there. For 64B payload (64+46=110B frame)
I understand that performance will be low for small packets. The thing I don't understand is why performance for 64B payload is better than performance for 128B payload. I get better throughput with 64B payloads.
Throughput I measure for payload level using provided notebook is 49.907 Gbps for 64B and 60.21 Gbps for 128B. If we translate these to frame level, using frame_level_throughput = payload_level_throughput * (payload size + header size) / (payload size), we get throughput as 85.77 Gbps for 64B payload (110B frame) and 81.84 Gbps for 128B payload, i.e., packet level throughput is higher for 64B payloads.
Even in the provided notebook, the throughput with 128B payloads is 5 Gbps lower than theoretical while for 64B payloads, it is close to theoretical, i.e., the efficiency for 64B payloads (smaller payload) is better than that with 128B payloads (larger payload).
Also any thoughts on why TX throughput might be higher than theoretical?
This is how I compute these throughputs
udp = 8
ip = 20
eth = 14
fcs = 4
ifg = 12
pp_amble = 8
def thr(payload_size: int):
total_bytes = payload_size + udp + ip + eth + fcs + ifg + pp_amble
payload_thr = payload_size / total_bytes
frame_thr = (payload_size + ip + udp + eth) / total_bytes
return payload_thr * 100 , frame_thr * 100.0
So, thr(64) = (49.23076923076923, 81.53846153846153)
and thr(128) = (65.97938144329896, 87.62886597938144)
I think your theoretical equation does not look right, as the payload (segment) size increases, the overhead decreases (the deficiency increases).
Correct me if I am missing something.
Someone already asked about this here, I suppose your theoretical throughput is what this person calls naked cmac, still does not mach the numbers I showed above. But, I believe the Python snippet is the correct way to compute this.
My defs are same except: frame_thr = (payload_size + ip + udp + eth + *fcs*) / total_bytes
Frame is what goes into tdata of the AXIS interfaces, i.e., payload + 46 bytes. This is also why tkeep in LATENCY kernel is 18 = 64 - 46, i.e., 18B payload gives frame of size 1 flit (TDATA WIDTH).
For the small packets, the issue is not in theoretical computation. The issue is that the measured frame throughput for 128B payload is smaller than 64B payload (in my setup). Even in the upstream notebook, the measured payload throughput for 128B payload is further away from theoretical (4 Gbps away, 65 Gbps theoretical, 61 Gbps measured) than 64B payload (0 Gbps difference, measured and theoretical = 49.231 Gbps). Even here, the measured L2 bandwidth is higher for 110B packet (95.45) compared to 174B (93.84).
For theoretical calculations (at least for theoretical throughput at payload level), I use exact same calculation as yours. I am seeing higher measured throughput than theoretical throughput at payload level of around 0.6 to 1.3 Gbps.
In above table the measured at payload level is directly reported by the notebook, i.e., from ol_w0_tg.compute_app_throughput('tx')
. The theoretical is taken from what you described here, i.e., payload_thr = 100 * payload_size / (payload_size + udp + ip + eth + fcs + ifg + pp_amble)
.
In this table, there are 2 weird things,
How many packets are you sending for each payload size?
For the above table, I sent 1 billion packets (basically the PRODUCER is set as in the notebook). Results are similar for 1 million packets as well.
I am not sure what is the best place to ask this question, hence asking here as an issue. Please let me know if some other place is preferred over this.
Q1: I get cases where observed throughput is higher than theoretical maximum, e.g., 97 Gbps throughput with 1472B payload where the theoretical is around 95 Gbps. 2 Gbps is large enough difference that I can't explain using a few cycle measurement mismatch. I noticed the recently changed juptyer notebook also shows this but the mismatch is only 0.0005 Gbps or so. What might be the reasons for higher than theoretical throughput measurement. If we under measure the time to receive packets by even 100 cycles, that may explain 0.0005 Gbps mismatch but not 2 Gbps mismatch.
Q2. I observe relatively lower throughput with 128B payload. Why might that be? Is it to do with bytes transmitted per flit? If so, 64B payload then should also have similar issue (55 bytes per flit). But we see that only 128B payload has ~5 Gbps gap to theoretical maximum, while other payload sizes don't have much gap.
Calculation: Flit is AXI transfer data width (64B). For 128B payload, frame size = 128 + UDP (8), IP (20), Ethernet(14) and FCS (4) = 174B, requires 3 flits (3 * 64B >= 174B). Or 174/3 bytes sent per flit = 58. Similarly for 64B payload, bytes per flit is 55. For other payload sizes bytes per flit is >= 60.